How to reduce memory used by Random Forest from Scikit-Learn in Python?
The Random Forest algorithm from scikit-learn package can sometimes consume too much memory:
- RandomForest - Reasons for memory usage / consumption?
- understanding scikit learn Random Forest memory requirement for prediction
- Scikit Learn RandomForest Memory Error
- Random Forest: Running out of memory
- Why is scikit-learn's random forest using so much memory?
- Memory allocation error in sklearn random forest classification python
The Random Forest Classifier and Random Forest Regressor have default hyper-parameters:
max_depth=None
,min_samples_split=2
,min_samples_leaf=1
,
which means that full trees are built. Bulding full trees is by design (see Leo Breiman, Random Forests article from 2001). The Random Forest creates full trees to fit the data well. If there will be one tree in the Random Forest, then the model will overfit the data. However, in the Random Forest there are created set of trees (for example 100 trees). To overcome the overfitting (and increase stability) the bagging and random subspace sampling are used. (Bagging - selecting subset of rows for training, random subspace sampling - selecting subset of columns in each node split search).
In the case of large data sets or complex datasets, the full tree can be really deep and have thousands of nodes. Such single decision tree will use a lot of memory and thus the memory consumption of the Random Forest will grow very fast. In this post I will show how to reduce memory consumption of the Random Forest. In the example I will use Adult Income dataset.
Let's load packages and the data
import os
import joblib
import pandas as pd
import numpy as np
from sklearn.ensemble.forest import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from matplotlib import pyplot as plt
df = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
skipinitialspace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 30725 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education-num 32561 non-null int64
5 marital-status 32561 non-null object
6 occupation 30718 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital-gain 32561 non-null int64
11 capital-loss 32561 non-null int64
12 hours-per-week 32561 non-null int64
13 native-country 31978 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
The dataset has 32,561
rows and 15
columns (including the target column). We see that data use about 3.8
MB in the memory (similar memory is also needed to store the data on the hard drive disk).
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
The data needs preprocessing. We will fill the missing values with the most frequent values and convert categoricals into integers.
df = df.fillna(df.mode().iloc[0])
for col in df.columns:
if df[col].dtype == "object":
encode = LabelEncoder()
df[col] = encode.fit_transform(df[col])
The first 14
columns will be used as input to the model. The last column income
will be the target column.
X = df[df.columns[:-1]]
y = df["income"]
Let's use 25%
of the data for testing and the rest for training.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=56)
I create the Random Forest Classifier with default parameters. This means that full trees will be built. There will be created 100
trees (the default of n_estimators
).
rf = RandomForestClassifier()
Let's train the model:
rf.fit(X_train, y_train)
Check the depth of the first tree in the Random Forest
print(rf.estimators_[0].tree_.max_depth)
>>> 41
Let's check the depth of all the trees in the Forest:
depths = [tree.tree_.max_depth for tree in rf.estimators_]
print(f"Mean tree depth in the Random Forest: {np.round(np.mean(depths))}")
>>> Mean tree depth in the Random Forest: 42.0
Check the size of single tree in the disk after saving with joblib:
joblib.dump(rf.estimators_[0], "first_tree_from_RF.joblib")
print(f"Single tree size: {np.round(os.path.getsize('first_tree_from_RF.joblib') / 1024 / 1024, 2) } MB")
>>> Single tree size: 0.52 MB
joblib.dump(rf, "RandomForest_100_trees.joblib")
print(f"Random Forest size: {np.round(os.path.getsize('RandomForest_100_trees.joblib') / 1024 / 1024, 2) } MB")
>>> Random Forest size: 49.67 MB
Our dataset size was 3.8
MB so the resulting Random Forest is about 13
times larger than the dataset! The dataset was pretty small, you can easily imagine how the Random Forest size will explode for larger files (the complexity of the dataset matters a lot because it determines the depth of the full tree).
Before changing anything in the Random Forest let's check its performance.
y_predicted = rf.predict_proba(X_test)
rf_loss = log_loss(y_test, y_predicted)
print(rf_loss)
>>> 0.34350442620035054
Reduce memory usage of the Scikit-Learn Random Forest
The memory usage of the Random Forest depends on the size of a single tree and number of trees. The most straight forward way to reduce memory consumption will be to reduce the number of trees. For example 10
trees will use 10
times less memory than 100
trees. However, the more trees in the Random Forest the better for performance and I will search for other hyper-parameters to control the Random Forest size.
The simplest way to reduce the memory consumption is to limit the depth of the tree. Shallow trees will use less memory. Let's train shallow Random Forest with max_depth=6
(keep number of trees as default 100
):
shallow_rf = RandomForestClassifier(max_depth=6)
shallow_rf.fit(X_train, y_train)
Let's save the shallow Decision Tree to the disk:
joblib.dump(shallow_rf.estimators_[0], "first_tree_from_shallow_RF.joblib")
print(f"Single tree size from shallow RF: {np.round(os.path.getsize('first_tree_from_shallow_RF.joblib') / 1024 / 1024, 2) } MB")
>>> Single tree size from shallow RF: 0.01 MB
You see, the full single tree size was: 0.52
MB while the shallow tree size is 0.01
MB. Let's save the whole forest:
joblib.dump(shallow_rf, "Shallow_RandomForest_100_trees.joblib")
print(f"Shallow Random Forest size: {np.round(os.path.getsize('Shallow_RandomForest_100_trees.joblib') / 1024 / 1024, 2) } MB")
>>> Shallow Random Forest size: 0.75 MB
The Random Forest with full trees has size 49.67
MB and the shallow Random Forest size is 0.75
MB so 66
times less!
49.67 / 0.75
>>> 66.22666666666667
Let's check the performance of such shallow tree:
y_predicted = shallow_rf.predict_proba(X_test)
shallow_rf_loss = log_loss(y_test, y_predicted)
print(shallow_rf_loss)
>>> 0.33017571925200956
The perfomance is better! The shallow Random Forest has about 4%
better logloss (the lower value the better). So we reduced the size of Random Forest by 66
times and increase the perfomance! :-)
The shallow trees can be also obtained by tuning min_samples_split
or min_samples_leaf
(or even other hyper-parameters, like: min_weight_fraction_leaf
, max_features
, max_leaf_nodes
). However, I prefer to tune max_depth
because it is more intuitive.
Extra tip for saving the Scikit-Learn Random Forest in Python
While saving the scikit-learn Random Forest with joblib you can use compress
parameter to save the disk space. In the joblib docs there is information that compress=3
is a good compromise between size and speed. Example below:
joblib.dump(rf, "RF_uncompressed.joblib", compress=0)
print(f"Uncompressed Random Forest: {np.round(os.path.getsize('RF_uncompressed.joblib') / 1024 / 1024, 2) } MB")
>>> Uncompressed Random Forest: 49.67 MB
joblib.dump(rf, "RF_compressed.joblib", compress=3) # compression is ON!
print(f"Compressed Random Forest: {np.round(os.path.getsize('RF_compressed.joblib') / 1024 / 1024, 2) } MB")
>>> Compressed Random Forest: 8.3 MB
np.round(49.67 / 8.3, 2)
>>> 5.98
Compressed Random Forest is 6
times smaller!
The same obervation about memory consumption should be valid for Extra Trees Classifier
and Extra Trees Regressor
.