How to reduce memory used by Random Forest from Scikit-Learn in Python?

The Random Forest algorithm from scikit-learn package can sometimes consume too much memory:

The Random Forest Classifier and Random Forest Regressor have default hyper-parameters:

max_depth=None,
min_samples_split=2,
min_samples_leaf=1,

which means that full trees are built. Bulding full trees is by design (see Leo Breiman, Random Forests article from 2001). The Random Forest creates full trees to fit the data well. If there will be one tree in the Random Forest, then the model will overfit the data. However, in the Random Forest there are created set of trees (for example 100 trees). To overcome the overfitting (and increase stability) the bagging and random subspace sampling are used. (Bagging - selecting subset of rows for training, random subspace sampling - selecting subset of columns in each node split search).

In the case of large data sets or complex datasets, the full tree can be really deep and have thousands of nodes. Such single decision tree will use a lot of memory and thus the memory consumption of the Random Forest will grow very fast. In this post I will show how to reduce memory consumption of the Random Forest. In the example I will use Adult Income dataset.

Let's load packages and the data

import os
import joblib
import pandas as pd
import numpy as np
from sklearn.ensemble.forest import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from matplotlib import pyplot as plt

df = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv", 
                 skipinitialspace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

The dataset has 32,561 rows and 15 columns (including the target column). We see that data use about 3.8 MB in the memory (similar memory is also needed to store the data on the hard drive disk).

df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

The data needs preprocessing. We will fill the missing values with the most frequent values and convert categoricals into integers.

df = df.fillna(df.mode().iloc[0])
for col in df.columns:
    if df[col].dtype == "object":
        encode = LabelEncoder()
        df[col] = encode.fit_transform(df[col])

The first 14 columns will be used as input to the model. The last column income will be the target column.

X = df[df.columns[:-1]]
y = df["income"]

Let's use 25% of the data for testing and the rest for training.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=56)

I create the Random Forest Classifier with default parameters. This means that full trees will be built. There will be created 100 trees (the default of n_estimators).

rf = RandomForestClassifier()

Let's train the model:

rf.fit(X_train, y_train)

Check the depth of the first tree in the Random Forest

print(rf.estimators_[0].tree_.max_depth)
>>> 41

Let's check the depth of all the trees in the Forest:

depths = [tree.tree_.max_depth for tree in rf.estimators_]
print(f"Mean tree depth in the Random Forest: {np.round(np.mean(depths))}")

>>> Mean tree depth in the Random Forest: 42.0

Check the size of single tree in the disk after saving with joblib:

joblib.dump(rf.estimators_[0], "first_tree_from_RF.joblib") 
print(f"Single tree size: {np.round(os.path.getsize('first_tree_from_RF.joblib') / 1024 / 1024, 2) } MB")

>>> Single tree size: 0.52 MB

joblib.dump(rf, "RandomForest_100_trees.joblib") 
print(f"Random Forest size: {np.round(os.path.getsize('RandomForest_100_trees.joblib') / 1024 / 1024, 2) } MB")

>>> Random Forest size: 49.67 MB

Our dataset size was 3.8 MB so the resulting Random Forest is about 13 times larger than the dataset! The dataset was pretty small, you can easily imagine how the Random Forest size will explode for larger files (the complexity of the dataset matters a lot because it determines the depth of the full tree).

Before changing anything in the Random Forest let's check its performance.

y_predicted = rf.predict_proba(X_test)
rf_loss = log_loss(y_test, y_predicted)
print(rf_loss)

>>> 0.34350442620035054

Reduce memory usage of the Scikit-Learn Random Forest

The memory usage of the Random Forest depends on the size of a single tree and number of trees. The most straight forward way to reduce memory consumption will be to reduce the number of trees. For example 10 trees will use 10 times less memory than 100 trees. However, the more trees in the Random Forest the better for performance and I will search for other hyper-parameters to control the Random Forest size.

The simplest way to reduce the memory consumption is to limit the depth of the tree. Shallow trees will use less memory. Let's train shallow Random Forest with max_depth=6 (keep number of trees as default 100):

shallow_rf = RandomForestClassifier(max_depth=6)
shallow_rf.fit(X_train, y_train)

Let's save the shallow Decision Tree to the disk:

joblib.dump(shallow_rf.estimators_[0], "first_tree_from_shallow_RF.joblib") 
print(f"Single tree size from shallow RF: {np.round(os.path.getsize('first_tree_from_shallow_RF.joblib') / 1024 / 1024, 2) } MB")

>>> Single tree size from shallow RF: 0.01 MB

You see, the full single tree size was: 0.52 MB while the shallow tree size is 0.01 MB. Let's save the whole forest:

joblib.dump(shallow_rf, "Shallow_RandomForest_100_trees.joblib") 
print(f"Shallow Random Forest size: {np.round(os.path.getsize('Shallow_RandomForest_100_trees.joblib') / 1024 / 1024, 2) } MB")

>>> Shallow Random Forest size: 0.75 MB

The Random Forest with full trees has size 49.67 MB and the shallow Random Forest size is 0.75 MB so 66 times less!

49.67 / 0.75

>>> 66.22666666666667

Let's check the performance of such shallow tree:

y_predicted = shallow_rf.predict_proba(X_test)
shallow_rf_loss = log_loss(y_test, y_predicted)
print(shallow_rf_loss)

>>> 0.33017571925200956

The perfomance is better! The shallow Random Forest has about 4% better logloss (the lower value the better). So we reduced the size of Random Forest by 66 times and increase the perfomance! :-)

The shallow trees can be also obtained by tuning min_samples_split or min_samples_leaf (or even other hyper-parameters, like: min_weight_fraction_leaf, max_features, max_leaf_nodes). However, I prefer to tune max_depth because it is more intuitive.

Extra tip for saving the Scikit-Learn Random Forest in Python

While saving the scikit-learn Random Forest with joblib you can use compress parameter to save the disk space. In the joblib docs there is information that compress=3 is a good compromise between size and speed. Example below:

joblib.dump(rf, "RF_uncompressed.joblib", compress=0) 
print(f"Uncompressed Random Forest: {np.round(os.path.getsize('RF_uncompressed.joblib') / 1024 / 1024, 2) } MB")

>>> Uncompressed Random Forest: 49.67 MB

joblib.dump(rf, "RF_compressed.joblib", compress=3)  # compression is ON!
print(f"Compressed Random Forest: {np.round(os.path.getsize('RF_compressed.joblib') / 1024 / 1024, 2) } MB")

>>> Compressed Random Forest: 8.3 MB

np.round(49.67 / 8.3, 2)

>>> 5.98

Compressed Random Forest is 6 times smaller!

The same obervation about memory consumption should be valid for Extra Trees Classifier and Extra Trees Regressor.