Sep 12 2020 · Jeff King

AutoML as easy as MLJar

If there has been an open-source library that has made me an avid machine learning practitioner and won the battle of the AutoMLs hands down it has to be MlJar. I simply can't stop eulogizing this library because it has helped overcome my deficiency in the field of coding and programming but at the same time automating the predictive modeling flow with very little user involvement. I have taken it for a spin in a few Hackathons and am not overtly surprised to find it amongst the top performers. It saves a lot of time as you do not need Data Preprocessing and feature Engineering before feeding the dataset to the model.

Automated machine learning (AutoML) has the potential to increase the productivity of data scientists significantly and believe me it does. AutoML services aim to automate some or all steps of the machine learning process which includes: Data Preprocessing, Feature Engineering, Feature Extraction, Algorithm Selection and Hyperparameters Optimization.

It does all the legwork on the dataset before fitting the model on to it. Without extrapolating my thoughts further about the simplicity and efficiency of mljar I outline this tutorial to show the ease-of-use and how best results can be achieved with a minimalistic approach and a few lines of code.

The first step is to install the library using:

pip install mljar-supervised

After importing the necessary libraries like pandas, numpy and sklearn metrics you import AutoML using:

from supervised.automl import AutoML

Then read in the train and test files:

h_train = pd.read_csv('/PATH TO THE FILE')
h_test = pd.read_csv('/PATH TO THE FILE')

Define X & y as follows:

X = h_train.drop("target", axis = 1)
y = h_train.target

Split the train dataset as under:

X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(X),
    y,
    test_size=0.25,   
    random_state=123,   
)

Train models with AutoML:

automl = AutoML(mode="Compete")
automl.fit(X_train, y_train)

Irrespective of your prediction task you create the object and then fit it on your training dataset. What do you do next? Put your feet up & let your hair down. It begins to spin and spit out tons of information beginning with identification of type of classification (binary, multiclass) or regression, evaluation metric, what algorithms it will use, stacking, ensemble and the steps it is going to follow before the training is complete.

Then compute the prediction on the test set and upload the resulting submission and get a pleasant surprise to find yourself among the top performers of any competition.

No hyperparameter tuning required. The optimization process is done by default. Ensemble of top performing models will be created and used for prediction. What more could you ask for?

In order to demonstrate the predictive prowess of mljar I took it for a ride and used the Data Scientists salary prediction dataset. I used H2O's Automl, AutoGluon and TPOT on the same dataset. Despite the HPO steps being offered by the various libraries I could not get them to come even close to the score achieved by mljar.

You can choose the various algorithms and thus have control of the features at the same time which means you are not at the mercy of the AutoML but instead in control of the parameters which you can further tweak whenever required. For example with regard to the aforementioned dataset I used the following code:

automl = AutoML(
    algorithms=["Random Forest", 
                "Neural Network",
                "Linear",
                "Xgboost",
                "Random Forest",
                "LightGBM"],
     total_time_limit=100,
     explain_level=0,
     mode="Perform"                 
)
automl.fit(X, y)

You can play with total_time_limit & mode depending on your requirements. Please refer to the documentation. I leave it to you, to try the various AutoMLs and find how mljar makes it easy to select ML algorithms, their parameter settings to improve the accuracy and ability to turn any given data into actionable insights.