AutoML HR Use Case

How AutoML Can Help

In our case, we are using the employee attrition dataset. This dataset includes comprehensive employee information such as age, job role, job satisfaction, monthly income etc. The primary prediction task is to identify which employees are likely to leave the company. With attributes like satisfaction level, work environment, and promotion history, this dataset is ideal for training models to predict employee turnover, which can be applied to various HR strategies such as retention planning and workforce management. MLJAR AutoML is transforming human resources by automating predictive analytics for employee attrition. This advanced tool simplifies the model development process, enabling effective retention strategies, optimizing workforce allocation, and enhancing overall employee satisfaction, all of which contribute to substantial business growth.

Business Value

30%
Faster

The time and effort needed for initial screenings can be greatly decreased using MLJAR AutoML, as it can screen and shortlist resumes approximately 30% faster than with traditional human processes.

40%
More Efficient

Because MLJAR AutoML can handle massive amounts of data 40% more effectively than conventional techniques, HR procedures will continue to be scalable and successful even as the company expands.

25%
More Effective

Compared to traditional methods, MLJAR AutoML can detect employees at danger of leaving 25% more successfully by studying patterns in behavior and feedback.

30%
Less Bias

With thorough data analysis, MLJAR AutoML technologies can eliminate HR decision-making biases by around 30%, producing more equitable and objective results.

AutoML Report

MLJAR AutoML produces an extensive report with plenty of data that throws light on evaluation measures, data analysis, and model performance. Here are a few samples of this.

Leaderboard

Logloss is the statistic that AutoML used to evaluate how models performed. The table and graph below show that Ensemble was selected as the best model.

Best model name model_type metric_type metric_value train_time
1_Baseline Baseline logloss 0.445174 2.16
2_DecisionTree Decision Tree logloss 0.422558 18.87
3_Linear Linear logloss 0.344594 9.88
4_Default_Xgboost Xgboost logloss 0.375246 6.77
5_Default_NeuralNetwork Neural Network logloss 0.380813 5.42
6_Default_RandomForest Random Forest logloss 0.368629 11.51
the best Ensemble Ensemble logloss 0.333119 3.29

Performance

AutoML Performance

Spearman Correlation of Models

The heatmap shows the pairwise Spearman correlation coefficients between multiple models. The degree and direction of the monotonic relationship between the two models' predictions are shown by each cell. Strong correlations are shown by values near to 1, and weak or no correlations are indicated by values near to 0. This heatmap makes it easier to see how well various models order data points in relation to one another.

models spearman correlation

Feature Importance

The graph illustrates how various features affect how the model predicts things. Every characteristic is assigned a value based on how much it adds to the model's predictive power; larger numbers signify more significance. This visualization helps in the understanding of which features most influence the performance of the model, directing efforts related to feature selection and model interpretation.

Feature Importance across models

Install and import necessary packages

Install the packages with the command:

pip install pandas, mljar-supervised, scikit-learn

Import the packages into your code:

# import packages
import pandas as pd
from supervised import AutoML
from sklearn.metrics import accuracy_score

Load training data

Import the employee attrition dataset containing information about employees.

# load example dataset
train = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/employee_attrition/HR-Employee-Attrition-train.csv")
# display DataFrame shape
print(f"Loaded data shape {train.shape}")
# display first rows
train.head()
Loaded data shape (1200, 35)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 55 No Travel_Rarely 452 Research & Development 1 3 Medical 1 374 ... 3 80 0 37 2 3 36 10 4 13
1 47 Yes Non-Travel 666 Research & Development 29 4 Life Sciences 1 376 ... 4 80 1 10 2 2 10 7 9 9
2 28 No Travel_Rarely 1158 Research & Development 9 3 Medical 1 377 ... 4 80 1 5 3 2 5 2 0 4
3 37 No Travel_Rarely 228 Sales 6 4 Medical 1 378 ... 2 80 1 7 5 4 5 4 0 1
4 21 No Travel_Rarely 996 Research & Development 3 2 Medical 1 379 ... 1 80 0 3 4 4 3 2 1 0

5 rows × 35 columns

Select X,y for ML training

Identify the feature variables (X), such as employee attributes, and the target variable (y), such as whether the employee left or stayed.

# create X columns list and set y column
x_cols = ["Age", "BusinessTravel", "DailyRate", "Department", "DistanceFromHome", "Education", "EducationField", "EmployeeCount", "EmployeeNumber", "EnvironmentSatisfaction", "Gender", "HourlyRate", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "Over18", "OverTime", "PercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "StandardHours", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager"]
y_col = "Attrition"
# set input matrix
X = train[x_cols]
# set target vector
y = train[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")
X shape is (1200, 34)
y shape is (1200,)

Fit AutoML

Train the AutoML model using the fit() method to predict employee attrition.

# create automl object
automl = AutoML(total_time_limit=300, mode="Explain")
# train automl
automl.fit(X, y)

Load test data

Let's load test data. We have Target in Attrition column, so we will check accuracy of our predictions later on. We will predict it with AutoML.

# load example dataset
test = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/employee_attrition/HR-Employee-Attrition-test.csv")
# display DataFrame shape
print(f"Loaded data shape {test.shape}")
# display first rows
test.head()
Loaded data shape (270, 35)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 35 columns

Compute predictions

Generate predictions on the testing data to identify the likelihood of employee turnover.

# predict with AutoML
predictions = automl.predict(test)
# predicted values
print(predictions)
['Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No'
 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'Yes' 'No'
 'No' 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'Yes' 'No'
 'No' 'Yes' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No']

Display result

We want to see clearly our predictions.

# create data frame and display it  
result = pd.DataFrame(data={"Attrition": predictions})
result
Attrition
0 Yes
1 No
2 No
3 No
4 No
... ...
265 No
266 No
267 No
268 No
269 No

270 rows × 1 columns

Compute accuracy

We need to retrieve the true values of employee attrition to compare with our predictions. After that, we compute the accuracy score.

# select true value column
true_values = test["Attrition"]
# compute metric
metric_accuracy = accuracy_score(true_values, predictions)
print(f"Accuracy: {metric_accuracy}")
Accuracy: 0.8777777777777778

Conlusions

Using MLJAR AutoML makes predicting employee attrition easier. It automatically builds and fine-tunes models, so HR teams can quickly analyze employee data and spot those likely to leave. With AutoML, there's less need for manual data work, making it a handy tool for improving staff retention.

See you soon👋.