How to set a lower priority for scikit-learn parallel processes spawned by n_jobs?

  Kiến thức lập trình

When using the n_jobs parameter to parallelize scikit-learn routines (like GridSeachCV), is there a way to set the parallel processes to a below-normal priority?

Often I would like to take maximum advantage of parallelization while still allowing concurrent use of my machine for regular computing tasks. I’m using Windows; opening Task Manager to manually set the priority to below normal for each of the spawned processes for n_jobs=-1 works well, but is very tedious.

Setting n_jobs to less than the available CPU threads or cores isn’t quite a satisfactory solution, because my machine responds slowly to regular tasks unless I reserve at least two physical CPU cores (e.g., n_jobs=6 on an 8-core 16-thread CPU), which leaves many threads underutilized, and I would like all threads to be utilized when I am not concurrently using my machine.

I found this answer to a question about lowering the process priority of child processes using the psutil.Process() class.

The following code mostly works:

import psutil

# parallel processing for GridSearchCV.fit()
current_process = psutil.Process()
original_priority = current_process.nice()
current_process.nice(psutil.BELOW_NORMAL_PRIORITY_CLASS)  # set parent priority
grid_search.fit(X, y)
current_process.nice(original_priority)  # reset parent priority

Here is a sample dataset and model to run the above code:

from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# sample dataset
dataset = load_digits()
X, y = dataset.data, dataset.target

# sample model
rf_model = RandomForestClassifier(n_jobs=-1)
param_grid = {'max_depth': [x for x in range(5, 20)], 
              'min_samples_split': [x for x in range(2, 10)]}
grid_search = GridSearchCV(rf_model, param_grid, n_jobs=-1)

This code will usually allow grid_search.fit() to successfully spawn its parallel processes at below normal priority, but not always. Sometimes only some processes have below normal priority while the majority are normal, and for whatever reason I can’t consistently reproduce that behavior. Also I’m not sure what the consequences are for setting the parent process to below normal priority (e.g. for interacting with a Jupyter Lab session or for stopping the kernel).

Is this the right way to go about setting a lower priority for scikit-learn parallel processes spawned by n_jobs? The scikit-learn docs on parallelism indicate that the n_jobs parameter uses joblib as its underlying implementation, but I didn’t see anything about process priority in either set of docs.

New contributor

Grendel13G is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

LEAVE A COMMENT