Introduction

Herein the performance of MLJAR on Kaggle dataset from “Give me some credit” challenge is reported. The obtained results are compared with other predictive APIs from Amazon, Google, PredicSis and BigML. This post was inspired with Louis Dorard's article.

Dataset

The dataset used in this article is from Kaggle website and can be downloaded from here. There are 150,000 samples in training dataset with 10 input attributes and binary target. The distributions and attributes’ types are presented in picture below. There are no categorical values in the dataset, however there are missing values in the dataset - they will be handled automatically by MLJAR and filled with median values. There is no additional preprocessing applied. The testing dataset from this competition has 101,503 samples (their values are not used for missing values inputation in training dataset). This dataset will be used for computing predictions, which will be submitted to Kaggle scoring system.

Dataset Preview

Models

In this analysis we used algorithms available from MLJAR platform:

  • Extreme Gradient Machines (xgboost package)
  • Random Forest (sklearn package)
  • Extra Trees (sklearn package)
  • Regularized Greedy Forest (C++ implementation)
  • k Nearest Neighbors (sklearn package)
  • Deep Neural Networks (Keras + Tensorflow)

Each algorithm is tuned separately - this include training and hyper-parameter search. Additionally, from all trained models an ensemble of models is created. Models are trained with 10 fold stratified cross validation on training dataset. The Area Under ROC Curve (AUC) metric is used to measure classifier’s performance.

Results

The results are summarized in table below. The highest AUC was obtained by ensemble of models. The best single algorithm performance was obtained by Xgboost. Surprisingly, Neural Networks have the poorest performance - just a little above than random classifier. This is probably the effect of missing proper scaling of input features.

enter image description here

For algorithms like: Xgboost, Random Forest and Extra Trees there is available information about features importance - it was presented in the figure 2. It can be observed that there is no single feature that dominates for all algorithms - this is because each algorithm uses features differently. That is why, the ensemble of all algorithms improves the overall performance.

enter image description here

In Louis Dorard’s article the performance of predictive APIs from Amazon, Google, PredicSis and BigML is compared. Herein, we add to this comparison a performance from MLJAR. All results are presented in table 2. The MLJAR is the most accurate, however its training and testing time are quite high - because MLJAR searches for the best model for each learning algorithm. As the result there were 69 models trained in total. Based on these models the ensemble was created from 16 selected models. The prediction time is also high because it is prediction from ensemble of models. The results of Amazon, Google, PredicSis, BigML are from Louis Dorard's article, where he trained algorithms on 90% of train data and validate on 10% train data and prediction times are computed on 5k samples. Herein, MLJAR was trained on 10 fold CV on full train dataset and prediction time is for full test set, which is 101k samples.

METRIC MLJAR AMAZON* GOOGLE* PREDICSIS* BIGML*
Accuracy (AUC) 0.867 0.862 0.743 0.858 0.853
Time for training (s) 32400 135 76 17 5
Time for predictions (s) 600 188 369 5 1

In Louis Dorard’s article there was an approximate rank in the Kaggle competition assessed. We present below approximate Kaggle rank in this competition for compared APIs, however for MLJAR rank was computed by Kaggle scoring system:

  • #6 for MLJAR
  • #60 for Amazon
  • #570 for PredicSis
  • #770 for BigML
  • #810 for Google

Summary

This comparison gives some taste of how MLJAR can be used in data analysis. It is slower than other services (Google, Amazon, PredicSis, BigML) because it learns several models for each algorithm. Therefore MLJAR can find the most accurate model. The speed of the MLJAR training can be easily improved by adding more machines for training (which is now 4 machines with 8CPU and 15 GB RAM per user). The MLJAR project with all results is public and can be accessed from here. There is a youtube video from project creation, here