AutoML Technological Use Case

How AutoML Can Help

We're going to use a credit card fraud detection dataset in this instance. Its purpose is to forecast the probability of a fraudulent transaction. 284,807 data samples of transactions total in this extensive dataset. These samples comprise a range of attributes, including transaction amount, duration, and anonymised features obtained by principal component analysis (PCA). Our goal is to create prediction models that correctly identify fraudulent transactions by studying this information. This will assist financial institutions reduce losses and safeguard customers from fraud. Financial institutions' approaches to fraud detection, risk management, and data analysis are being completely transformed by AutoML. MLJAR AutoML provides many advantages through model development process automation that result in notable business growth.

Business Value

30%
Faster

When compared to conventional approaches, MLJAR AutoML can shorten the time needed for model development by about 30%, which might encourage creativity and shorten the time it takes for new features and products to reach the market.

25%
More Scalable

By improving scalability and flexibility by approximately 25%, MLJAR AutoML tools facilitate the handling of various data sources and enable adaptation to evolving business requirements and technology breakthroughs.

40%
More Accessible

MLJAR AutoML systems have the potential to democratize AI capabilities by making advanced machine learning techniques 40% more approachable for teams lacking deep knowledge. This will enable more teams to include machine learning into their projects.

20%
More Competitive

Tech companies can get a 20% competitive advantage by using MLJAR AutoML, which speeds up the construction and deployment of sophisticated machine learning models and improves product differentiation and market positioning.

AutoML Report

MLJAR AutoML offers profound insights into model performance, data analysis, and assessment metrics through its capacity to provide extensive reports filled with insightful data. Here are a couple such examples.

Leaderboard

To evaluate the effectiveness of trained models, AutoML has used logloss as its performance measure. As can be seen in the table and graph below, 3_Default_Xgboost was consequently chosen as the best model.

Best model name model_type metric_type metric_value train_time
1_Baseline Baseline logloss 0.0128227 1.96
2_DecisionTree Decision Tree logloss 0.00327765 27.18
the best 3_Default_Xgboost Xgboost logloss 0.00171326 17.88
4_Default_NeuralNetwork Neural Network logloss 0.00380802 28.83
5_Default_RandomForest Random Forest logloss 0.00244796 38.73
Ensemble Ensemble logloss 0.00171326 6.43

AutoML Performance

AutoML Performance

Spearman Correlation of Models

The plot illustrates the relationships between different models based on their rank-order correlation. Spearman correlation measures how well the relationship between two variables can be described using a monotonic function, highlighting the strength of the models' performance rankings. Higher correlation values indicate stronger relationships, where models perform similarly across different metrics or datasets.

models spearman correlation

Feature Importance

The plot visualizes the significance of various features across different models. In this heatmap, each cell represents the importance of a specific feature for a particular model, with the color intensity indicating the level of importance. Darker or more intense colors signify higher importance, while lighter colors indicate lower importance. This visualization helps in comparing the contribution of features across multiple models, highlighting which features consistently play a critical role and which are less influential in predictive performance.

Feature Importance across models

Install and import necessary packages

Install the packages with the command:

pip install pandas, mljar-supervised, scikit-learn

Import the packages into your code:

# import packages
import pandas as pd
from supervised import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Load data

Import the dataset containing information about credit card fraud.

# read data from csv file
df = pd.read_csv(r"C:\Users\Your_user\technological-use-case\creditcard.csv")
# display data shape
print(f"Loaded data shape {df.shape}")
# display first rows
df.head()
Loaded data shape (284807, 31)
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 69.99 0

5 rows × 31 columns

Split dataframe to train/test

To split a dataframe into train and test sets, we divide the data to create separate datasets for training and evaluating a model. This ensures we can assess the model's performance on unseen data.

This step is essential when you have only one base dataset.

# split data
train, test = train_test_split(df, train_size=0.95, shuffle=True, random_state=42)
# display data shapes
print(f"All data shape {df.shape}")
print(f"Train shape {train.shape}")
print(f"Test shape {test.shape}")
All data shape (284807, 31)
Train shape (270566, 31)
Test shape (14241, 31)

Select X,y for ML training

We will split the training set into features (X) and target (y) variables for model training.

# create X columns list and set y column
x_cols = ["Time", "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12", "V13", "V14", "V15", "V16", "V17", "V18", "V19", "V20", "V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "Amount"]
y_col = "Class"
# set input matrix
X = train[x_cols]
# set target vector
y = train[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")
X shape is (270566, 30)
y shape is (270566,)

Fit AutoML

We need to train a model for our dataset. The fit() method will handle the model training and optimization automatically.

# create automl object
automl = AutoML(total_time_limit=300, mode="Explain")
# train automl
automl.fit(X, y)

Compute predictions

Use the trained AutoML model to make predictions on test data.

# predict with AutoML
predictions = automl.predict(test)
# predicted values
print(predictions)
[1 0 0 ... 0 0 0]

Compute accuracy

We need to retrieve the true values of credit card fraud to compare with our predictions. After that, we compute the accuracy score.

# get true values of fraud
true_values = test["Class"]
# compute metric
metric_accuracy = accuracy_score(true_values, predictions)
print(f"Accuracy: {metric_accuracy}")
Accuracy: 0.9915736254476512

Conlusions

When applied to climate model simulations, MLJAR AutoML provides a number of benefits for enhancing model dependability and comprehending accidents in the simulation. MLJAR AutoML improves the precision of simulation results prediction and failure root cause analysis by automating the intricate process of data analysis. Due to this advanced automation, researchers can handle large amounts of simulation data quickly and effectively, identifying trends and other elements that they might have missed using more conventional techniques. The increasing assimilation of this technology has potential to propel our comprehension of climate processes and augment the accuracy of climate forecasts.

See you soon👋.