AutoML Student Performance Use Case

How AutoML Can Help

In this use case, we explore a dataset from two Portuguese schools that looks at secondary education student accomplishment. This dataset includes a wide range of characteristics gathered from school reports and surveys, including student grades, demographic information, social variables, and school-related elements. Predicting students final grades using these inputs is the aim. By automating the creation of prediction models, MLJAR AutoML can speed up this procedure and provide a thorough examination of the variables affecting academic achievement. We can quickly identify key predictors of student success and generate actionable insights to improve educational strategies and support systems by utilizing the sophisticated algorithms and automated model evaluation of MLJAR AutoML. This will ultimately assist educators in making data-driven decisions to improve student outcomes.

Business Value

25%
More Accurate

MLJAR AutoML can improve prediction accuracy of student outcomes by up to 25% by automating model selection and refining. This increased accuracy makes it easier to detect at-risk pupils and create interventions that work.

30%
Lower costs

MLJAR AutoML can reduce the expenses of data science skills and model generation by up to 30% by automating complicated data analysis tasks. This makes advanced analytical tools more affordable for educational institutions with tight budgets.

40%
Resource Optimization

By putting MLJAR into practice, AutoML can reduce the amount of time spent on data preprocessing and model validation by up to 40%, freeing up resources for educational institutions that can be better spent implementing practical strategies rather than managing data.

AutoML Report

Because of its ability to provide comprehensive reports with informative data, MLJAR AutoML provides deep insights into model performance, data analysis, and evaluation measures. Below are a few instances of this kind.

Leaderboard

To evaluate the effectiveness of trained models, AutoML has used logloss as its performance measure. As can be seen in the table and graph below, Ensemble was consequently chosen as the best model.

Best model name model_type metric_type metric_value train_time
1_Baseline Baseline logloss 2.48491 0.76
2_DecisionTree Decision Tree logloss 1.83772 20.24
3_Linear Linear logloss 0.25672 15.34
4_Default_Xgboost Xgboost logloss 0.213171 9.17
5_Default_NeuralNetwork Neural Network logloss 0.184801 1.22
6_Default_RandomForest Random Forest logloss 1.42314 17.79
the best Ensemble Ensemble logloss 0.168914 0.29

AutoML Performance

AutoML Performance

Spearman Correlation of Models

The graphic illustrates how important different features are in relation to different models. Each cell in this heatmap shows the significance of a single characteristic for a certain model, and the color intensity corresponds to the degree of significance. Lighter colors suggest lesser importance, while darker or more vivid colors indicate higher importance. By comparing the contributions of various models, this visualization aids in identifying which elements regularly contribute significantly to prediction performance and which have less of an impact.

models spearman correlation

Feature Importance

The plot visualizes the significance of various features across different models. In this heatmap, each cell represents the importance of a specific feature for a particular model, with the color intensity indicating the level of importance. Darker or more intense colors signify higher importance, while lighter colors indicate lower importance. This visualization helps in comparing the contribution of features across multiple models, highlighting which features consistently play a critical role and which are less influential in predictive performance.

Feature Importance across models

Install and import necessary packages

Install the packages with the command:

pip install pandas, mljar-supervised, scikit-learn

Import the packages into your code:

# import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised import AutoML
from sklearn.metrics import accuracy_score

Load dataset

Import the dataset containing information about students performance.

# read data from csv file
df = pd.read_csv(r"C:\Users\my_notebooks\student\student-por.csv", delimiter=";")
# display data shape
print(df.shape)
# display first rows
df.head()
(649, 33)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4 3 4 1 1 3 4 0 11 11
1 GP F 17 U GT3 T 1 1 at_home other ... 5 3 3 1 1 3 2 9 11 11
2 GP F 15 U LE3 T 1 1 at_home other ... 4 3 2 2 3 3 6 12 13 12
3 GP F 15 U GT3 T 4 2 health services ... 3 2 2 1 1 5 0 14 14 14
4 GP F 16 U GT3 T 3 3 other other ... 4 3 2 1 2 5 0 11 13 13

5 rows × 33 columns

Select features and target

We will split the dataset into features (X) and target (y) variables for model training.

# create X columns list and set y column
x_cols = ["school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", "reason", "guardian", "traveltime", "studytime", "failures", "schoolsup", "famsup", "paid", "activities", "nursery", "higher", "internet", "romantic", "famrel", "freetime", "goout", "Dalc", "Walc", "health", "absences"]
y_col = "G3"
# set input matrix
X = df[x_cols]
# set target vector
y = df[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")
X shape is (649, 30)
y shape is (649,)

Split dataframe to train/test

To split a dataframe into train and test sets, we divide the data to create separate datasets for training and evaluating a model. This ensures we can assess the model's performance on unseen data.

This step is essential when you have only one base dataset.

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.90, shuffle=True, random_state=42)
# display data shapes
print(f"X_train shape {X_train.shape}")
print(f"X_test shape {X_test.shape}")
print(f"y_train shape {y_train.shape}")
print(f"y_test shape {y_test.shape}")
X_train shape (584, 30)
X_test shape (65, 30)
y_train shape (584,)
y_test shape (65,)

Fit AutoML

We need to train a model for our dataset. The fit() method will handle the model training and optimization automatically.

# create automl object
automl = AutoML(total_time_limit=300, mode="Explain")
# train automl
automl.fit(X_train, y_train)

Compute predictions

Use the trained AutoML model to make predictions on test data.

# predict with AutoML
predictions = automl.predict(X_test)
# predicted values
print(predictions)
[19 12 18 11 11 17 18  8 10 11 18 11 12  9 12 14 13  8 15 14 15 13 13 13
 16 13  8 12 10 15 16 12  9  8 18 16 17 15 14 11 13 10  8 11 14 12 18 12
 14 12 10 11 14 11 11 18 10 11 11 10  8 11 16 14 17]

Compute accuracy

We are comparing valid values with our predictions. To do that we will use accuracy score.

# compute metric
metric_accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {metric_accuracy}")
Accuracy: 0.9692307692307692

Conlusions

By improving predictive accuracy and expediting the model development process, MLJAR AutoML provides a number of noteworthy benefits when it comes to student performance prediction. MLJAR AutoML expedites the development of efficient models and enhances the accuracy of student outcome predictions by automating the intricate analysis of a variety of demographic and educational data. With the help of this sophisticated automation, researchers and educators can quickly analyze big datasets and find important patterns and insights that would have gone unnoticed using more conventional techniques. Consequently, educational institutions can enhance their ability to make data-driven decisions, optimize their teaching strategies, and better support student achievement with the help of MLJAR AutoML. This leads to improvements in student achievement and academic performance.

See you soon👋.