Testimonial - MLJAR to the rescue
I was and still am fascinated by Machine Learning. Coming from a Pharmaceutical background without knowledge of programming or any kind of coding experience I thought I would not be able to get a piece of this new Tech cake. But with the advent of Automated Machine learning (AutoML) non-data scientists like myself have an array of tools to satisfy their once-thought incurable itch to create ML models without writing a single line of code. But perseverance is the name of the game and I through application and with the help of countless videos taught myself to learn how to create ML algorithms. Having explored a few of the available AutoML tools I just want to outline my trip to this amazing world of AutoML with a use case scenario providing some insight into the performance of the various open-sourced AutoML solutions at the same time. I assume you are aware of the major tasks in the machine learning workflow namely data preparation, feature engineering, training a model, evaluation of the model, hyperparameter tuning and finally serving the model.
A heads-up on this article: It is my personal opinion and perspective with regard to the field of Data Science, Machine learning and the tools available which I have tried and used to learn and vie for honours in the Kaggle competitions. It is not a pedantic approach and is devoid of Machine Learning Practitioner’s terminologies. It is straight from the heart and mind of a self-taught and fascinated person. The dataset I used to cut my ML tooth on is from Kaggle. It is the Santander Customer transaction prediction competition. The challenge is to identify which customers will make a specific transaction. The data has about 200,000 records with more than 200 features and the task is to predict the value of the target column in the test set. The dataset is clean with no missing values and no categorical variables. So to borrow a word from the Data Scientists there is no need of preprocessing.
But as I wanted to see which of these ML tools come out on top I decided to give a few of them a whirl. I started with Scikit-learn and used Keras which is a high-level neural networks API. In other words it is a Python Deep Learning library which was developed with a focus on enabling fast experimentation. I used Keras because while I was reading the Keras Documentation I read that besides being more productive it allows the user to try more ideas faster – which in turn can translate to help you win machine learning competitions. After all, there is no substitute to winning! After creating and training the model I used the testing dataset from this Kaggle competition to compute predictions which will be eventually submitted to the Kaggle scoring system. My submission got a score of 86.1%. I was glad but not thrilled because the highest score on the leaderboard was around 92%.
I then decided to try another state-of-the-art machine learning algorithm called CatBoost. It is available as an open source library. One outstanding feature is its support for categorical features out of the box. My submission got an impressive score of 89.6%. I was happy to see a considerable increase in the score.
That got me excited and then I stumbled across H2O AutoML which is a function of H2O ML library that automates the process of building large number of models, with the goal of finding the best model without any prior knowledge or effort by the Data Scientist. AutoML in general is considered to be about algorithm selection, hyperparameter tuning of models, iterative modeling, and model assessment. What I liked about H2O AutoMl is its interface which is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit the number of total models trained. My submission using H2O AutoML received a score of 88.7%.
I was taken aback because after reading the H2O documentation I was convinced that it will outscore my previous scores but that was not to be.
So, I thought I cannot beat CatBoost and was quite content with the score I had achieved. But then I read about MLJAR on Quora and I set out to try it out immediately. I have since not looked anywhere else. It is as it claims to be – a platform for rapid prototyping, developing and deploying machine learning models. MLJAR has taken AutoML to a whole new level. What I love about it is with MLJAR each selected algorithm has set of hyperparameters tuned in accordance to your dataset. MLJAR automatically runs through features, algorithms, and hyperparameters for basic machine learning algorithms. The training of various algorithms is time-consuming but MLJAR reduces this time quite significantly by training simultaneously on many machines to give quick results. I have realized that hyperparameter tuning is a trial and error thing. MLJAR makes the entire algorithm search and training-tuning a painless process. It handles all the hard work and trains and tunes your model for you. It enables those with no machine learning expertise to train high quality ML models and improve the efficiency of finding optimal solutions to various machine learning problems.
Kindly forgive me if I eulogize MLJAR . It got me the highest score of 89.8%. I was thrilled and ecstatic. Yes I had achieved this score without using all the resources available and with only 5 minutes of tuning of a single model because I was pressed for computational credits and so I limited my algorithm search adventure to only LightGBM, RandomForest,Neural Networks and LogisticRegression. The ensemble of all algorithms gave me the highest score which I then used on the testing dataset for prediction. I am sure I can better this score by using all the algorithms and increasing the training of the model but that’s for the future.