AutoML Financial Use Case

How AutoML Can Help

In our case, we are going to use a credit scoring dataset. It is designed to predict the likelihood that an individual will experience financial difficulties in the next two years. This comprehensive dataset contains 150,000 data samples of individuals, which include various characteristics such as age, monthly income, debt ratio, number of dependents, and more. By analyzing this dataset, we aim to develop predictive models that accurately estimate the likelihood of financial distress, helping institutions make informed credit decisions and implement proactive risk management strategies. MLJAR AutoML is revolutionizing financial institutions' approaches to credit scoring, risk management, and data analysis. By automating the model development process, MLJAR AutoML offers numerous benefits that translate into significant business growth.

Business Value

25%
More Accurate

Compared to conventional techniques, MLJAR AutoML may produce predictions that are up to 25% more accurate, which enhances decision-making and lowers financial losses.

40%
Less Time

MLJAR AutoML can save around 40% more time and resources by reducing the need for complex coding and manual changes, freeing teams to concentrate on important priorities.

30%
Faster

Using MLJAR AutoML to automate model development can speed up the process by roughly 30% while cutting expenses and increasing output.

20%
More Insightful

Institutions can get a competitive edge by utilizing MLJAR AutoML's sophisticated data analysis capabilities, which can yield around 20% deeper insights from huge datasets.

AutoML Report

MLJAR AutoML generates a comprehensive report, offering a wealth of information that provides valuable insight into model performance, data analysis, and evaluation metrics. Some examples of these are shown below.

Leaderboard

To evaluate the effectiveness of trained models, AutoML has used logloss as its performance measure. As can be seen in the table and graph below, 3_Default_Xgboost was consequently chosen as the best model.

Best model name model_type metric_type metric_value train_time
1_Baseline Baseline logloss 0.245422 1.12
2_DecisionTree Decision Tree logloss 0.192961 21.89
the best 3_Default_Xgboost Xgboost logloss 0.177134 12.41
4_Default_NeuralNetwork Neural Network logloss 0.190949 15.9
5_Default_RandomForest Random Forest logloss 0.18375 15.63
Ensemble Ensemble logloss 0.177134 4.43

Performance

AutoML Performance

Spearman Correlation of Models

The plot illustrates the relationships between different models based on their rank-order correlation. Spearman correlation measures how well the relationship between two variables can be described using a monotonic function, highlighting the strength of the models' performance rankings. Higher correlation values indicate stronger relationships, where models perform similarly across different metrics or datasets.

models spearman correlation

Feature Importance

The plot visualizes the significance of various features across different models. In this heatmap, each cell represents the importance of a specific feature for a particular model, with the color intensity indicating the level of importance. Darker or more intense colors signify higher importance, while lighter colors indicate lower importance. This visualization helps in comparing the contribution of features across multiple models, highlighting which features consistently play a critical role and which are less influential in predictive performance.

Feature Importance across models

Install and import necessary packages

Install the packages with the command:

pip install pandas, scikit-learn, mljar-supervised

Import the packages into your code:

# import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised import AutoML
from sklearn.metrics import accuracy_score

Load data

Import relevant financial data for credit scoring analysis.

# load example dataset
df = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/credit/data.csv")
# display DataFrame shape
print(f"Loaded data shape {df.shape}")
# display first rows
df.head()
Loaded data shape (150000, 12)
Id SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents
0 1 1 0.766127 45 2 0.802982 9120.0 13 0 6 0 2.0
1 2 0 0.957151 40 0 0.121876 2600.0 4 0 0 0 1.0
2 3 0 0.658180 38 1 0.085113 3042.0 2 1 0 0 0.0
3 4 0 0.233810 30 0 0.036050 3300.0 5 0 0 0 0.0
4 5 0 0.907239 49 1 0.024926 63588.0 7 0 1 0 0.0

5 rows × 12 columns

Split dataframe to train/test

To split a dataframe into train and test sets, we divide the data to create separate datasets for training and evaluating a model. This ensures we can assess the model's performance on unseen data.

This step is essential when you have only one base dataset.

# split data
train, test = train_test_split(df, train_size=0.75, shuffle=True, random_state=42)
# display data shapes
print(f"All data shape {df.shape}")
print(f"Train shape {train.shape}")
print(f"Test shape {test.shape}")
All data shape (150000, 12)
Train shape (112500, 12)
Test shape (37500, 12)

Select X,y for ML training

We will split the training set into features (x_train) and target (y_train) variables for model training.

# create X columns list and set y column
x_cols = ["Id", "RevolvingUtilizationOfUnsecuredLines", "age", "NumberOfTime30-59DaysPastDueNotWorse", "DebtRatio", "MonthlyIncome", "NumberOfOpenCreditLinesAndLoans", "NumberOfTimes90DaysLate", "NumberRealEstateLoansOrLines", "NumberOfTime60-89DaysPastDueNotWorse", "NumberOfDependents"]
y_col = "SeriousDlqin2yrs"
# set input matrix
x_train = train[x_cols]
# set target vector
y_train = train[y_col]
# display data shapes
print(f"x_train shape is {x_train.shape}")
print(f"y_train shape is {y_train.shape}")
x_train shape is (112500, 11)
y_train shape is (112500,)

Select X,y for evaluating the ML model

We will split the test set into features (x_test) and target (y_test) variables to evaluate the model's performance.

# create X columns list and set y column
x_cols = ["Id", "RevolvingUtilizationOfUnsecuredLines", "age", "NumberOfTime30-59DaysPastDueNotWorse", "DebtRatio", "MonthlyIncome", "NumberOfOpenCreditLinesAndLoans", "NumberOfTimes90DaysLate", "NumberRealEstateLoansOrLines", "NumberOfTime60-89DaysPastDueNotWorse", "NumberOfDependents"]
y_col = "SeriousDlqin2yrs"
# set input matrix
x_test = test[x_cols]
# set target vector
y_test = test[y_col]
# display data shapes
print(f"x_test shape is {x_test.shape}")
print(f"y_test shape is {y_test.shape}")
x_test shape is (37500, 11)
y_test shape is (37500,)

Fit AutoML

We need to train a model for our dataset. The fit() method will handle the model training and optimization automatically.

# create automl object
automl = AutoML(total_time_limit=300, mode="Explain")
# train automl
automl.fit(x_train, y_train)

Compute predictions

Generate predictions on the test data and display the results.

# predict with AutoML
predictions = automl.predict(x_test)
# predicted values
print(predictions)
[0 0 0 ... 0 0 0]

Display result

We want to see clearly our predictions.

# create mapping dict
pred_mapping = {0: 'No', 1: 'Yes'}
# convert list using comprehension
cat_list = [pred_mapping[x] for x in predictions]
# create data frame and display it
result = pd.DataFrame(data = {"Difficulties": cat_list}, index=test["Id"])
result
Id Difficulties
59771 No
21363 No
127325 No
140510 No
144298 No
... ...
77730 No
87367 No
135908 No
70825 No
40590 No

37500 rows × 1 columns

Compute accuracy

We are computing the accuracy score and valid values (y_test) with our predictions.

# compute metric
metric_accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {metric_accuracy}")
Accuracy: 0.9376

Conclusions

The integration of MLJAR AutoML in the financial sector is a game-changer. AutoML provides highly accurate credit risk assessments, streamlined loan approvals, and enhanced fraud detection. By automating complex data analysis and decision-making processes, it enables financial institutions to make more informed, timely decisions while improving efficiency and reducing operational costs. As AutoML technology continues to evolve, its role in enhancing accuracy and reliability in credit scoring and other financial applications will become increasingly vital, driving innovation and growth in the industry.

See you soon👋.