Jun 24 2020 · Piotr Płoński

How to save and load Random Forest from Scikit-Learn in Python?

In this post I will show you how to save and load Random Forest model trained with scikit-learn in Python. The method presented here can be applied to any algorithm from sckit-learn (this is amazing about scikit-learn!).

Additionally, I will show you, how to compress the model and get smaller file.

For saving and loading I will be using joblib package.

Let's load scikit-learn and joblib

import os
import joblib
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

Create some dataset (I will use Iris dataset which is built-in in sklearn):

iris = load_iris()
X = iris.data
y = iris.target

Train the Random Forest classifier:

rf = RandomForestClassifier()
rf.fit(X,y)

Let's check the predicted output:

rf.predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

Let's save the Random Forest. I'm using joblib.dump method. The first argument of the method is variable with the model. The second argument is the path and the file name where the resulting file will be created.

# save
joblib.dump(rf, "./random_forest.joblib")

To load the model back I use joblib.load method. It takes as argument the path and file name. I will load the forest to new variable loaded_rf. Please notice that I don't need to initilize this variable, just load the model into it.

# load, no need to initialize the loaded_rf
loaded_rf = joblib.load("./random_forest.joblib")

Let's check if it works, by computing predictions, they should be exactly the same as from the rf model.

loaded_rf.predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

They are the same. We successfully save and loaded back the Random Forest.

Extra tip for saving the Scikit-Learn Random Forest in Python

While saving the scikit-learn Random Forest with joblib you can use compress parameter to save the disk space. In the joblib docs there is information that compress=3 is a good compromise between size and speed. Example below:

joblib.dump(rf, "RF_uncompressed.joblib", compress=0) 
print(f"Uncompressed Random Forest: {np.round(os.path.getsize('RF_uncompressed.joblib') / 1024 / 1024, 2) } MB")

>>> Uncompressed Random Forest: 0.17 MB
joblib.dump(rf, "RF_compressed.joblib", compress=3)  # compression is ON!
print(f"Compressed Random Forest: {np.round(os.path.getsize('RF_compressed.joblib') / 1024 / 1024, 2) } MB")

>>> Compressed Random Forest: 0.03 MB

Compressed Random Forest is 5.6 times smaller! The compression can be used to any sckit-learn model (sklearn is amazing!).

Become a Data Science wizard, today!

Forget about Python problems, just do your work.

MLJAR Studio