Random Forest vs AutoML (with python code)

Random Forest versus AutoML you say. Hmmm..., it's obvious that the performance of AutoML will be better. You will check many models and then ensemble them. This is true, but I would like to show you other advantages of AutoML, that will help you deal with dirty, real-life data.

  1. I will show you how to prepare the data and train Random Forest model on Adult dataset with python and scikit-learn.
  2. On the same dataset, I will show you how to train Random Forest with AutoML mljar-supervised, which is an open source package developed by me :) You will see how AutoML can make your life easier when dealing with real-life, dirty data.

The link to the code in Jupyter Notebook is here.

Random Forest in python and sklearn

Let's load python packages that we will use.

# packages for data manipulation
import numpy as np
import pandas as pd
# packages to split data and compute logloss
import sklearn.model_selection
from sklearn.metrics import log_loss
# packages to train models
from sklearn.ensemble import RandomForestClassifier
from supervised.automl import AutoML

As you see, loading AutoML package is as simple as importing sklearn package. Let's load data, and split it into train and test subsets.

# load dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv', skipinitialspace=True)
x_cols = [c for c in df.columns if c != 'income']
# set input matrix and target column
X = df[x_cols]
y = df['income']
# split data into training and testing
seed = 1234
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.3, random_state=seed)

As you see I use 30% of data for testing. Data is loaded from my github repository with a list of datasets which are good for starting the machine learning journey. Let's take a preview of the data.

UCI Adult dataset preview

OK, data is loaded so, let's train machine learning model!

rf = RandomForestClassifier(n_estimators = 1000)
rf = rf.fit(X_train, y_train)

As you see, training Random Forest with sklearn is simple (and as you will see further, training is simple, but preparing the data is not). In the first line, we initialize the model and in the second line, we train it. But wait, let's execute the code. Bum! Errors!

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
      1 rf = RandomForestClassifier(n_estimators = 1000)
----> 2 rf = rf.fit(X_train, y_train)

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
    248 
    249         # Validate or convert input data
--> 250         X = check_array(X, accept_sparse="csc", dtype=DTYPE)
    251         y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
    252         if sample_weight is not None:

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    525             try:
    526                 warnings.simplefilter('error', ComplexWarning)
--> 527                 array = np.asarray(array, dtype=dtype, order=order)
    528             except ComplexWarning:
    529                 raise ValueError("Complex data not supported\n"

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: 'Private'

We have ValueError, because there is 'Private' value in the workclass column. This column has categorical values. Hmmm ..., this is strange, the Random Forest is an ensemble of decision trees, and decision trees should work with categorical values ... Quick googling, and you got confirmation that scikit-learn Random Forest doesn't work with categorical values and that somebody is working on this in sklearn (stackoverflow link). We need to convert our categorical columns into numerical values. The common option here is one-hot encoding or converting into integers. Let's convert categorical values to integers.

from sklearn.preprocessing import LabelEncoder

for column in ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex','native-country']:
    categorical_convert = LabelEncoder()
    X_train[column] = categorical_convert.fit_transform(X_train[column])
    X_test[column] = categorical_convert.transform(X_test[column])

Execute the code, and ... error again!

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
      3 for column in ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex','native-country']:
      4     categorical_convert = LabelEncoder()
----> 5     X_train[column] = categorical_convert.fit_transform(X_train[column])
      6     X_test[column] = categorical_convert.transform(X_test[column])

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
    235         y = column_or_1d(y, warn=True)
--> 236         self.classes_, y = _encode(y, encode=True)
    237         return y
    238 

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/preprocessing/label.py in _encode(values, uniques, encode)
    107     if values.dtype == object:
--> 108         return _encode_python(values, uniques, encode)
    109     else:
    110         return _encode_numpy(values, uniques, encode)

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/preprocessing/label.py in _encode_python(values, uniques, encode)
     61     # only used in _encode below, see docstring there for details
     62     if uniques is None:
---> 63         uniques = sorted(set(values))
     64         uniques = np.array(uniques, dtype=values.dtype)
     65     if encode:

TypeError: '<' not supported between instances of 'str' and 'float'

Uhhh!!! Now we got the error because there are missing values in the column. Take a look:

X_train['workclass'].head(10)
# the output
29700    Private
1529     Private
27477    Private
31950    Private
4732     Private
10858    Private
24518    Private
10035        NaN
1324     Private
26727    Private

Ok, we need to fill missing values. We will fill missing values with mode (the most common value in each column).

train_mode = X_train.mode().iloc[0]
X_train = X_train.fillna(train_mode)
X_test = X_test.fillna(train_mode)

OK, missing values filled! :) Now, let's convert categorical columns and we are ready for Random Forest training.

# let's convert categorical columns
from sklearn.preprocessing import LabelEncoder

for column in ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex','native-country']:
    categorical_convert = LabelEncoder()
    X_train[column] = categorical_convert.fit_transform(X_train[column])
    X_test[column] = categorical_convert.transform(X_test[column])

Error!!! Again?!?!?

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/preprocessing/label.py in _encode_python(values, uniques, encode)
     67         try:
---> 68             encoded = np.array([table[v] for v in values])
     69         except KeyError as e:

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/preprocessing/label.py in [HTML_REMOVED](.0)
     67         try:
---> 68             encoded = np.array([table[v] for v in values])
     69         except KeyError as e:

KeyError: 'United-States'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
      4     categorical_convert = LabelEncoder()
      5     X_train[column] = categorical_convert.fit_transform(X_train[column])
----> 6     X_test[column] = categorical_convert.transform(X_test[column])

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/preprocessing/label.py in transform(self, y)
    255             return np.array([])
    256 
--> 257         _, y = _encode(y, uniques=self.classes_, encode=True)
    258         return y
    259 

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/preprocessing/label.py in _encode(values, uniques, encode)
    107     if values.dtype == object:
--> 108         return _encode_python(values, uniques, encode)
    109     else:
    110         return _encode_numpy(values, uniques, encode)

~/sandbox/rf/automl-rf/venv/lib/python3.6/site-packages/sklearn/preprocessing/label.py in _encode_python(values, uniques, encode)
     69         except KeyError as e:
     70             raise ValueError("y contains previously unseen labels: %s"
---> 71                              % str(e))
     72         return uniques, encoded
     73     else:

ValueError: y contains previously unseen labels: 'United-States'

The error occurs because of new categorical values in test data. Yes, you can avoid this error when you will convert categoricals in train and test data together. However, there is a high chance that such a situation will appear for you on production! And your super-machine-learning script will be broken, because of dirty-data! ;) Let's fix this:

column = 'native-country'
all_values = np.unique(list(X_test[column].values))
diff = np.setdiff1d(all_values, categorical_convert.classes_)
categorical_convert.classes_ = np.concatenate((categorical_convert.classes_, diff))
X_test[column] = categorical_convert.transform(X_test[column])

To fix, this problem, I just add new integer for new categorical values. Such new categoricals check should be implemented in the production script!

Data is ready! Let's train Random Forest! :)

rf = RandomForestClassifier(n_estimators = 1000)
rf = rf.fit(X_train, y_train)

The Random Forest hyperparameters are left as default, except the number of trees which I set to 1000 (the more trees in Random Forest the better). The training part was easy, but preparing the data was hard. You know the 80/20 rule of data preprocessing. Let's revise what we need to do to prepare data:

  1. Fill missing values.
  2. Convert categoricals to integers.
  3. Add additional check for new categoricals.

Keep in mind, that if you decide to use this Random Forest model in production you will need to implement these preprocessing steps in production script as well (to make data understandable to your machine learning model).

Finally, let's compute the logloss on test data:

log_loss(y_test, rf.predict_proba(X_test)[:,1])
0.3281618346723231

The Random Forest AutoML way

We will reload the data.

df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv', skipinitialspace=True)
x_cols = [c for c in df.columns if c != 'income']
X = df[x_cols]
y = df['income']
seed = 1234
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.3, random_state=seed)

Let's train the Random Forest with AutoML:

automl = AutoML(total_time_limit=60, algorithms=["RF"], train_ensemble=False)
automl.fit(X_train, y_train)

That's all!

I set training limit for 60 seconds, select for training only the Random Forest algorithm (RF) and set False for train_ensemble to not ensemble models at the end (I want single RF model as result).

Let's compute the logloss:

log_loss(y_test, automl.predict(X_test)['p_>50K'])
0.3148709233696557

As you see the logloss is slightly better for Random Forest trained with AutoML but it is not the most important here. We were able to get the Random Forest model quickly without wasting time on data preprocessing which is automated in the AutoML. What is more, the all details about preprocessing can be easily exported to json and saved.

automl.to_json()
{'best_model': {'uid': 'bf1dc426-ff83-4897-8151-1172b95ce2a3',
  'algorithm_short_name': 'RF',
  'framework_file': 'bf1dc426-ff83-4897-8151-1172b95ce2a3.framework',
  'framework_file_path': '/tmp/bf1dc426-ff83-4897-8151-1172b95ce2a3.framework',
  'preprocessing': [{'missing_values': [{'fill_method': 'na_fill_median',
      'fill_params': {'workclass': 'Private',
       'occupation': 'Prof-specialty',
       'native-country': 'United-States'}}],
    'categorical': [{'convert_method': 'categorical_to_int',
      'convert_params': {'workclass': {'Federal-gov': 0,
        'Local-gov': 1,
        'Never-worked': 2,
        'Private': 3,
        'Self-emp-inc': 4,
        'Self-emp-not-inc': 5,
        'State-gov': 6,
        'Without-pay': 7},
       'education': {'10th': 0,
        '11th': 1,
        '12th': 2,
        '1st-4th': 3,
        '5th-6th': 4,
        '7th-8th': 5,
        '9th': 6,
        'Assoc-acdm': 7,
        'Assoc-voc': 8,
        'Bachelors': 9,
        'Doctorate': 10,
        'HS-grad': 11,
        'Masters': 12,

... removed for readability ...

   {'library_version': '0.20.3',
    'algorithm_name': 'Random Forest',
    'algorithm_short_name': 'RF',
    'uid': '23f4469a-c545-4efd-91e6-066cd75f59f6',
    'model_file': '23f4469a-c545-4efd-91e6-066cd75f59f6.rf.model',
    'model_file_path': '/tmp/23f4469a-c545-4efd-91e6-066cd75f59f6.rf.model',
    'params': {'model_type': 'RF',
     'seed': 4,
     'criterion': 'entropy',
     'max_features': 0.3,
     'min_samples_split': 50,
     'min_samples_leaf': 5}}],
  'params': {'additional': {'trees_in_step': 10,
    'train_cant_improve_limit': 5,
    'max_steps': 500,
    'max_rows_limit': None,
    'max_cols_limit': None},
   'preprocessing': {'columns_preprocessing': {'workclass': ['na_fill_median',
      'categorical_to_int'],
     'education': ['categorical_to_int'],
     'marital-status': ['categorical_to_int'],
     'occupation': ['na_fill_median', 'categorical_to_int'],
     'relationship': ['categorical_to_int'],
     'race': ['categorical_to_int'],
     'sex': ['categorical_to_int'],
     'native-country': ['na_fill_median', 'categorical_to_int']},
    'target_preprocessing': ['na_exclude', 'categorical_to_int']},
   'validation': {'validation_type': 'kfold', 'k_folds': 5, 'shuffle': True},
   'learner': {'model_type': 'RF',
    'seed': 4,
    'criterion': 'entropy',
    'max_features': 0.3,
    'min_samples_split': 50,
    'min_samples_leaf': 5}}},
 'threshold': 0.3342849477117008}

In the JSON output, you can see selected Random Forest hyperparameters values:

'criterion': 'entropy',
'max_features': 0.3,
'min_samples_split': 50,
'min_samples_leaf': 5

The number of trees (n_estimators) in the Random Forest is determined with early stopping.

You can easily check how many models were checked for you in AutoML and what scores they had:

automl.get_leaderboard()

AutoML Leaderboard

The AutoML has checked 15 Random Forest models with different hyperparameters and select the model with the best logloss score (the lowest) as the final model.

It is important to mention, that the AutoML framework is prepared for new categorical values or missing values in new columns. So, you don't need to worry about them on production.

Random Forest variable importance

As a result of AutoML training, we have a single Random Forest from scikit-learn. So you can easily check variable importance for it (amazing, isn't it? :)).

Access the scikit model inside AutoML:

# the sklearn Random Forest inside AutoML
automl._best_model.learners[0].model
# the RF with parameters
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features=0.3, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=50,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1,
            oob_score=False, random_state=4, verbose=0, warm_start=True)

Let's do the variables importance plot.

import matplotlib.pyplot as plt
var_imp = automl._best_model.learners[0].model.feature_importances_
plt.barh(X_train.columns, var_imp)

Random Forest variable importance

Conclusion

In this article, I've trained the Random Forest with sklearn and with mljar-supervised AutoML package. For model training with the sklearn, there were needed preprocessing steps: filling missing values and converting categoricals columns to numerical values. The AutoML solution automates data preprocessing and therefore makes model creation faster. What is more, the AutoML has hyperparameters tuning built-in so the final model performance was better. The AutoML final model is the RandomForestClassifier from the sklearn library and can be easily accessed, for example for variable importance plot.