Nov 04 2022 · Aleksandra Płońska, Piotr Płoński

The 2 ways to save and load scikit-learn model

2 ways to save and load scikit-learn modelAfter training of Machine Learning model, you need to save it for future use. In this article, I will show you 2 ways to save and load scikit-learn models. One method is using pickle package, it is fast but the model can take more storage than in the second approach. The alternative is to use joblib package, which can save some space on disk but is slower than the pickle.

We will first show you how to save and load scikit-learn models with pickle and joblib. Then we will measure the time needed by each package to save and load the same model. Additionally, we will check the storage needed to save the model in the disk for both libraries.

1. Save and load the scikit-learn model with pickle

The pickle library is a standard Python package - you don't need to install anything additional. It can be used to save and load any Python object to the disk.

Here is a Python snippet that shows how to save and load the scikit-learn model:

import pickle

filename = "my_model.pickle"

# save model
pickle.dump(model, open(filename, "wb"))

# load model
loaded_model = pickle.load(open(filename, "rb"))

The full code example is below. The outline:

  • train Random Forest model,
  • save model to disk,
  • load model from disk,
  • compute predictions with loaded model.
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# library for save and load scikit-learn models
import pickle

# load example data from sklearn
X, y = load_iris(return_X_y=True)

# create Random Forest Classifier
rf = RandomForestClassifier()

# fit model with all data - it is just example!
rf.fit(X, y)

# file name, I'm using *.pickle as a file extension
filename = "random_forest.pickle"

# save model
pickle.dump(rf, open(filename, "wb"))

# load model
loaded_model = pickle.load(open(filename, "rb"))

# you can use loaded model to compute predictions
y_predicted = loaded_model.predict(X)

2. Save and load the scikit-load model with joblib

The joblib package needs to be installed additionally. It can be easily added to the Python environment with the below command:

pip install joblib

Below is an example Python code that shows how to save and load the scikit-learn model with joblib:

import joblib

filename = "my_model.joblib"

# save model
joblib.dump(rf, filename)

# load model
loaded_model = joblib.load(filename)

The complete example code with joblib:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# library for save and load scikit-learn models
import joblib

# load example data from sklearn
X, y = load_iris(return_X_y=True)

# create Random Forest Classifier
rf = RandomForestClassifier()

# fit model with all data - it is just example!
rf.fit(X, y)

# file name, I'm using *.joblib as a file extension
filename = "random_forest.joblib"

# save model
joblib.dump(rf, filename)

# load model
loaded_model = joblib.load(filename)

# you can use loaded model to compute predictions
y_predicted = loaded_model.predict(X)

In joblib you can pass the file name in the dump() or load() functions. In pickle we need to pass the file handle.

Compare the performance of pickle vs. joblib

First, let's compare the time needed to save and load the scikit-learn model. I'm using timeit magic command from Jupyter Notebook that runs code several times and measures the mean time needed to execute the function. In both cases, I'm saving and loading the same model for pickle and joblib.


filename = 'random_forest.pickle'

# save with pickle
%timeit -n 100 pickle.dump(rf, open(filename, 'wb'))

>> 2.73 ms ± 82.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# load with pickle
%timeit -n 100 loaded_model = pickle.load(open(filename, 'rb'))

>> 2.31 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


filename = "random_forest.joblib"

# save with joblib
%timeit -n 100 joblib.dump(rf, filename)

>> 35.7 ms ± 317 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# load with joblib
%timeit -n 100 loaded_model = joblib.load(filename)

>> 26.7 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Here is a screenshot from my notebook:

Measure time needed to save and load model

The pickle package is over 10 times faster than joblib for saving and loading models. It is a huge difference, especially if you are building an online service with Machine Learning models used for inference - the speed of response is crucial, and even milliseconds can make a difference.

Let's compare the file size. The file created with pickle has 165.3 KB. The file created with joblib has 170.4 KB. It was the same scikit-learn model. The pickle is faster (for saving and loading) and produces smaller files.

However, the joblib package has argument compress in the dump() function. It controls the level of file compression. It can be controlled with integer, boolean or touple (please check docs for more details). Herein, we will use an integer value from 0 to 9, where a higher number means more compression. Let's check the final file size for different levels of compression.

How scikit-learn model file size depends on compress parameter

For the highest compression level, we get the file size 19.6 KB - it is over 8.6 smaller size than with no compression.

Let's check how saving time depends on compression level:

How scikit-learn model file save time depends on compress parameter

As expected, the larger the compression level, the more time is needed to save the model. Surprisingly, the load time is almost constant:

How scikit-learn model file load time depends on compress parameter

In joblib documentation, there is a note that compress=3 is often a good compromise between compression size and writing speed.

Security

There is a note in the pickle documentation that it can be insecure. It is shouldn't be used to load files from untrusted sources, because it can execute malicious code. For example, if you are building online Machine Learning service that accepts uploaded models, then you should use joblib. If you are bulding a Machine Learning system that will use only scikit-learn models produced by your system, then pickle will be a good choice.

Summary

There are two packages, pickle and joblib that can be used to save and load scikit-learn models. They have very similar APIs. The pickle package is faster in saving and loading models. The joblib can produce smaller file sizes thanks to compression. Additionally, scikit-learn models can be saved to PMML or ONNX formats, but additional packages are needed: sklearn-onnx and sklearn2pmml.

Become a Data Science wizard, today!

Forget about Python problems, just do your work.

MLJAR Studio