MLJAR AutoML solution for Kaggle Playground series s4e6

Kaggle is a platform for data science competitions. It has Playground series of competitions, where participants can sharpen their data science skills. In this notebook, we will build a Machine Learning pipeline for Playground series s4e6 competition.

The data set describes students dropout and academic success. I will use MLJAR AutoML package to create Machine Learning pipeline. The AutoML will be trained for ~1 hour, with accuracy as evaluation metric. This notebook gives 0.83571 on public learderboard. By increasing training time to ~2 hours, the accuracy will increase to 0.83699, small improvement but it counts on Kaggle ๐Ÿ˜Š.

The notebooks was created with MLJAR Studio. It is a new notebook based programming envrionment for everyone. Please notice, that some code cells have User Interface forms above, they are used to click the code. MLJAR Studio provides you Graphical User Interface for generating Python code, we call those as code recipes. MLJAR Studio is a desktop app, it can be downloaded from our portal.

Happy programming! ๐Ÿค“

# import packages
import pandas as pd
from supervised import AutoML

Load training data

We will load training data CSV and display first rows.

# read data from csv file
df = pd.read_csv(r"C:\Users\pplon\Downloads\playground-series-s4e6\train.csv")
# display first rows
df.head()

Select X,y for ML training

We need to select X and y for Machine Learning training. The X is an input matrix with features. The y is a target vector with values that will be predicted by ML.

# create X columns list and set y column
x_cols = [
    "Marital status",
    "Application mode",
    "Application order",
    "Course",
    "Daytime/evening attendance",
    "Previous qualification",
    "Previous qualification (grade)",
    "Nacionality",
    "Mother's qualification",
    "Father's qualification",
    "Mother's occupation",
    "Father's occupation",
    "Admission grade",
    "Displaced",
    "Educational special needs",
    "Debtor",
    "Tuition fees up to date",
    "Gender",
    "Scholarship holder",
    "Age at enrollment",
    "International",
    "Curricular units 1st sem (credited)",
    "Curricular units 1st sem (enrolled)",
    "Curricular units 1st sem (evaluations)",
    "Curricular units 1st sem (approved)",
    "Curricular units 1st sem (grade)",
    "Curricular units 1st sem (without evaluations)",
    "Curricular units 2nd sem (credited)",
    "Curricular units 2nd sem (enrolled)",
    "Curricular units 2nd sem (evaluations)",
    "Curricular units 2nd sem (approved)",
    "Curricular units 2nd sem (grade)",
    "Curricular units 2nd sem (without evaluations)",
    "Unemployment rate",
    "Inflation rate",
    "GDP",
]
y_col = "Target"
# set input matrix
X = df[x_cols]
# set target vector
y = df[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")

Fit AutoML

We will set training time for 1 hour and evaluation metric as accuracy. The fit() method is doing heavy lifting.

# create automl object
automl = AutoML(total_time_limit=3600, mode="Compete", eval_metric="accuracy")
# train automl
automl.fit(X, y)

Load test data

Let's load test data. We don't have Target column there. We will predict it with AutoML.

# read data from csv file
test = pd.read_csv(r"C:\Users\pplon\Downloads\playground-series-s4e6\test.csv")
# display first rows
test.head()

Compute predictions

Compute predictions on test data and print them.

# predict with AutoML
predictions = automl.predict(test)
# predicted values
print(predictions)

Load example submission

We load example submission, to have good format of result file.

# read data from csv file
submission = pd.read_csv(
    r"C:\Users\pplon\Downloads\playground-series-s4e6\sample_submission.csv"
)
# display first rows
submission.head()

Assign predictions

Let's copy our predictions to submission DataFrame.

submission["Target"] = predictions

Save predictions to file

We will save predictions to CSV file in format ready to submit in Kaggle platform.

# write DataFrame to CSV
submission.to_csv(
    r"C:\Users\pplon\Downloads\playground-series-s4e6\automl_submission.csv",
    index=False,
)
print(
    r"DataFrame saved at C:\Users\pplon\Downloads\playground-series-s4e6\automl_submission.csv"
)

Conclusions

The AutoML solutions make it easy to build Machine Learning models. It is a great tool. However, you still need envrionment where you load and prepare data for analysis.

That's why we created MLJAR Studio, a programming environment for everyone.

Recipes used in the automl-kaggle-playground-series-s4e6.ipynb

All code recipes used in this notebook are listed below. You can click them to check their documentation.

Packages used in the automl-kaggle-playground-series-s4e6.ipynb

List of packages that need to be installed in your Python environment to run this notebook. Please note that MLJAR Studio automatically installs and imports required modules for you.

pandas>=1.0.0

mljar-supervised>=1.1.7