Run all jupyter notebooks inside folder and subfolders in an ordered manner and displaying process

  Kiến thức lập trình

I want to run (in an alphabetical order, not in parallel) all jupyter notebooks inside a folder that contain many subfolders.

  • Each subfolder may contain other folders.
  • Folders contain jupyter notebooks and the resulting outcome (csv, json, excel, jpg files).
  • It is important that files run in order, since the outcome of one notebook is used by others as input source.

As jupyter notebooks are executed, I would like the see a print stating the jupyter notebook and the path.

Until now I would create a jupyter notebook inside each folder, and run inside all the jupyter notebooks from the same folder using %run samplenotebook1.ipynb. However, this becomes tedious when there are numerous folders and subfolders and thus I need to speed up.

I have tried the solution from this post, the notebook seems to be running but if I open any of the folders where jupyter notebooks are supposed to run, I cannot see any file generated from running the jupyter notebooks.

Code below is the one I used, but the result was not the desired one.

import papermill as pm
from glob import glob

for nb in glob('*.ipynb'):
    pm.execute_notebook(
        input_path=nb,
        output_path=nb,
        engine_name='embedded',
    )

I have also tried code below, I get a print as if it were done, but it only runs notebooks in the folder, not in subfolders.

import papermill as pm
from pathlib import Path

for nb in Path('../').glob('*.ipynb'):
    pm.execute_notebook(
        input_path=nb,
        output_path=nb  # Path to save executed notebook
    )
print ('done')

Code above will run jupyter notebooks from the path, but not from inner folders. I tried adding /* in the path, but it would not work.

11

Seems the main thing you needed to change was to make the glob collecting the notebook names recursive so that it looked in subdirectories.

This doesn’t do alphabetical (yet), but does all notebooks as you asked in your latest comment:

# For recursive running of each notebook`Path('.').glob('**/*.ipynb')`, based on /a/64162840/8508004
import papermill as pm
from pathlib import Path

for nb in Path('.').glob('**/*.ipynb'):
    print(f"Processing {nb}...")
    if 'Untitled' not in str(nb):
        pm.execute_notebook(
            input_path=nb,
            output_path=nb  # Path to save executed notebook
        )
print ('done')

If you need to specify order, you’d sort the results of the glob() by your preference.
You’ll most likely want/need to convert it to a list object for this. (One way would be to typecase it list(Path('.').glob('**/*.ipynb')).) Something like this to combine this sorting with iterating on the sorted list as a replacement for nb in Path('.').glob('**/*.ipynb')::

notebooks_to_be_run = Path('.').glob('**/*.ipynb')
notebooks_to_be_run_ordered = sorted([str(x) for x in notebooks_to_be_run], key=str.casefold)
for nb in notebooks_to_be_run_ordered:

That code includes the entire path in the sort and so it may not be what you want, but may work for those with only notebooks in one or two directories, especially if you add in , reverse=True after key=str.casefold.

key=str.casefold comes from here. You’d want to leave that out if you want to go by case with uppercase coming before names that begin with lowercase.

If you file names involve numbers, you may be interested in the drop-in replacement for sorted() that is natsort.sorted(), see here and here.
natsort.sorted() works with key=str.casefold, too.

I probably should add that if your notebook list is going to be predictable/and or you need an order that is just so custom that it is hard to code, you can use the result of notebooks_to_be_run to hand-edit and make the list notebooks_to_be_run_ordered have the exact order you want. Example, imagine notebooks_to_be_run gives ["a.ipynb", "b.ipynb", "c.ipynb"] and for whatever reason you’d like c, then a, and then b run. Just before the line for nb in notebooks_to_be_run_ordered: you can assign the order in the list you want to ‘hard code’ the specified order, replacing the for loop-initiating line like so:

notebooks_to_be_run_ordered = ["c.ipynb", "a.ipynb", "b.ipynb"]
for nb in notebooks_to_be_run_ordered:

That would be tedious if you have a lot with long paths, but it is doable. And if it changes often, you could even imagine making a widget that runs prior to the for loop that lets you sort the list as you wish.


One reason I like Jupytext for running notebooks is that Papermill will sometimes add cruft directly into your notebooks. And so if you find that annoying, you can substitute in the use of Jupytext for the step to execute the notebooks.

8

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

LEAVE A COMMENT