MLJAR AutoML adds integration with Optuna

The MLJAR provides an open-source Automated Machine Learning framework for creating Machine Learning pipelines. It has a built-in heuristic algorithm for hyperparameters tuning based on: random search over a defined set of hyperparameters values, and hill-climbing over best solutions to search for further improvements. This solution works very well on Machine Learning tasks under a selected time budget. However, there might be situations when the model performance is the primary goal and the time needed for computation is not the limit. Thus, we propose the new mode: "Optuna" in the MLJAR framework. In this mode, we utilize the Optuna hyperparameters tuning framework. It is availbale in the [mljar-supervised](https://github.com/mljar/mljar-supervised) package starting from version `0.10.0`.

Optuna framework

The Optuna is an open-source framework for hypermarameters optimization developed by Preferred Networks. It provides many optimization algorithms for sampling hyperparameters, like:

Sampler using grid search: GridSampler,
Sampler using random sampling: RandomSampler,
Sampler using TPE (Tree-structured Parzen Estimator) algorithm: TPESamples

What is more, it has also an option to prune unpromising trials, which greatly speed-up the search process.

To learn more about Optuna please check:

Optuna website: https://optuna.org/
GitHub repository: https://github.com/optuna/optuna
Documentation: https://optuna.readthedocs.io/

AutoML + Optuna

Optuna provides hyperparameters search framework. However, it is not enough to build end-to-end Machine Learning pipeline. To create complete ML pipeline there is still a lot of work to be done:

define the training algorithm,
setup the hyperparameters with search spaces,
perform data preprocessing,
evaluate the trial model to assess its performance.

The MLJAR AutoML can help with those! Thus, the combining MLJAR and Optuna gives a powerfull framework for model construction.

Implementation

The integration with Optuna is done as an additional mode called Optuna. The MLJAR has now four modes:

Explain for initial data exploration,
Perform for creating production-level ML systems,
Compete for creating the best performin ML systems under selected time constraint,
Optuna - the mode for creating the best performing ML systems without hard time budget.

There are two new arguments which can be passed to AutoML() class during initialization:

optuna_time_budget - it is time which Optuna will use to tune each algorithm. It is in seconds. The default is set to 3600 seconds.
optuna_init_params - if we have already tuned params from Optuna we can pass it with this argument. The default is empty dict {}.

Optuna integration works with the following algorithms: Extra Trees, Random Forest, Xgboost, LightGBM, and CatBoost. If you set the optuna_time_budget=3600 and select all algorithms, it means that each algorithm will be tuned for 1 hour by Optuna.

By default, all feature engineering steps are switched to ON in the Optuna mode. The algorithms with newly created data after feature engineering are not tuned again. Only algorithms with original data (no feature engineering) are tuned with Optuna.

It is important to mention that Optuna is running optimization with data split that is the same as in the first fold from cross validation. In the case of running AutoML with train/test split, it will use this train/test split for optimization.

When using Optuna please set total_time_limit high, for example 48*3600 just to make sure that it will have time to execute all steps.

The Optuna mode can optimize the following metrics: "auc", "logloss", "rmse", "mae", "mape".

Simple Interface for Tuning

You can use MLJAR AutoML to easily tune algorithms with Optuna. Let's look how the code will look for the LightGBM tuning:

automl = AutoML(mode="Optuna", 
                eval_metric="auc",
                algorithms=["LightGBM"],
                optuna_time_budget=3600,   # tune each algorithm for 30 minutes
                total_time_limit=48*3600,  # total time limit, set large enough to have time to compute all steps
                features_selection=False,  # switch off feature engineering
                mix_encoding=False,
                golden_features=False,
                kmeans_features=False)
automl.fit(X,y)

That's all! You don't need to manually define search spaces, or preprocess the data. It is all handled by the AutoML.

Tabular Playground Series Mar-2021

I've used the data from Kaggle competition: Tabular Playground Series - Mar 2021. It is a binary classification challange with Area Under ROC Curve (AUC) as evaluation metric.

# read data
train = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')
x_cols = train.columns[1:-1]
y_col = "target"

# train AutoML
automl = AutoML(mode="Optuna", 
                eval_metric="auc",
                algorithms=["LightGBM", "Xgboost", "Extra Trees"],
                optuna_time_budget=1800,   # tune each algorithm for 30 minutes
                total_time_limit=48*3600,  # total time limit, set large enough to have time to compute all steps
                features_selection=False
automl.fit(train[x_cols], train[y_col])

# compute prediction on test data
preds = automl.predict_proba(test[x_cols])

# save submission
submission = pd.DataFrame({'id':test.id, 'target': preds[:,1]})
submission.to_csv('1_submission.csv', index=False)

You can chek the Kaggle notebook with results at this link. The AutoML will give 100th place over the 828 teams (at the time of writing, but let's wait for final results).

Performance on Otto and BNP kaggle competitions

We tested Optuna mode on two Kaggle competitions:

In the Otto competition, the AutoML was working for 58,936 seconds (~16.5 hours). It has score 0.42018 (LogLoss) on Private Leaderboard (evaluated internally by Kaggle system). It means that it will end-up competition on 167-th place out of 3507 participants.

In the BNP competition, the AutoML was trained from 30,015 seconds (~ 8.5 hours). Ti has score 0.43924 (LogLoss) on Private Leaderboard. It will end the competition on 42-th place out of 2920 teams.

The results were obtained without any human help. The code for both compaetitions:

automl = AutoML(
    mode="Optuna",
    total_time_limit=48* 3600,
    optuna_time_budget=3600,
)
automl.fit(X, y)

Summary

We hope that Optuna integration in the MLJAR AutoML framework will help you to get really nice ML results. We are looking for feedback from you about this integration. We are also open for new features requests.