How Random Forest handle missing value?

  Kiến thức lập trình

What is the technic used in Random Forest Regressor from scikit-learn to handle missing value ?

First I thought that a Random Forest regressor was able to natively handle missing value during training and production, without loosing any accuracy (sklearn doc, stack question).
Therefore, I used the RandomForestResgressor from scikit-learn, on data with a huge amount of Missing Value (see image below).

It’s working, but now, after some research, it seems to me that the Random Forest model does not handle Missing Value natively, but that libraries use various techniques such as imputation/knn/Miss Forest/etc. to handle Missing Value.

I would like to know how scikit-learn is dealing with missing value in their implementation of RandomForestRegressor, to understand whether it’s okay to have a large amount of missing value, and that it will not decrease the accuracy, or whether it’s better to clean most of them upstream (removing observations and features with a lot of missing value).


My example:
Each column is a feature, and each row is an observation. We see for example that the 4 features on the right are present only in few observations. Depending on how Random Forest handle missing value, it may be preferable to remove these features before training the model.

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

LEAVE A COMMENT