AutoML HR Use Case

How AutoML Can Help

In our case, we are using the employee attrition dataset. This dataset includes comprehensive employee information such as age, job role, job satisfaction, monthly income etc. The primary prediction task is to identify which employees are likely to leave the company. With attributes like satisfaction level, work environment, and promotion history, this dataset is ideal for training models to predict employee turnover, which can be applied to various HR strategies such as retention planning and workforce management. MLJAR AutoML is transforming human resources by automating predictive analytics for employee attrition. This advanced tool simplifies the model development process, enabling effective retention strategies, optimizing workforce allocation, and enhancing overall employee satisfaction, all of which contribute to substantial business growth.

Business Value

30%
Faster

The time and effort needed for initial screenings can be greatly decreased using MLJAR AutoML, as it can screen and shortlist resumes approximately 30% faster than with traditional human processes.

40%
More Efficient

Because MLJAR AutoML can handle massive amounts of data 40% more effectively than conventional techniques, HR procedures will continue to be scalable and successful even as the company expands.

25%
More Effective

Compared to traditional methods, MLJAR AutoML can detect employees at danger of leaving 25% more successfully by studying patterns in behavior and feedback.

30%
Less Bias

With thorough data analysis, MLJAR AutoML technologies can eliminate HR decision-making biases by around 30%, producing more equitable and objective results.

AutoML Report

MLJAR AutoML produces an extensive report with plenty of data that throws light on evaluation measures, data analysis, and model performance. Here are a few samples of this.

Leaderboard

Logloss is the statistic that AutoML used to evaluate how models performed. The table and graph below show that Ensemble was selected as the best model.

Best model	name	model_type	metric_type	metric_value	train_time
	1_Baseline	Baseline	logloss	0.445174	2.16
	2_DecisionTree	Decision Tree	logloss	0.422558	18.87
	3_Linear	Linear	logloss	0.344594	9.88
	4_Default_Xgboost	Xgboost	logloss	0.375246	6.77
	5_Default_NeuralNetwork	Neural Network	logloss	0.380813	5.42
	6_Default_RandomForest	Random Forest	logloss	0.368629	11.51
the best	Ensemble	Ensemble	logloss	0.333119	3.29

Performance

AutoML Performance

Spearman Correlation of Models

The heatmap shows the pairwise Spearman correlation coefficients between multiple models. The degree and direction of the monotonic relationship between the two models' predictions are shown by each cell. Strong correlations are shown by values near to 1, and weak or no correlations are indicated by values near to 0. This heatmap makes it easier to see how well various models order data points in relation to one another.

models spearman correlation

Feature Importance

The graph illustrates how various features affect how the model predicts things. Every characteristic is assigned a value based on how much it adds to the model's predictive power; larger numbers signify more significance. This visualization helps in the understanding of which features most influence the performance of the model, directing efforts related to feature selection and model interpretation.

Feature Importance across models

Install and import necessary packages

Install the packages with the command:

pip install pandas, mljar-supervised, scikit-learn

Import the packages into your code:

# import packages
import pandas as pd
from supervised import AutoML
from sklearn.metrics import accuracy_score

Load training data

Import the employee attrition dataset containing information about employees.

# load example dataset
train = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/employee_attrition/HR-Employee-Attrition-train.csv")
# display DataFrame shape
print(f"Loaded data shape {train.shape}")
# display first rows
train.head()

Loaded data shape (1200, 35)

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	55	No	Travel_Rarely	452	Research & Development	1	3	Medical	1	374	...	3	80	0	37	2	3	36	10	4	13
1	47	Yes	Non-Travel	666	Research & Development	29	4	Life Sciences	1	376	...	4	80	1	10	2	2	10	7	9	9
2	28	No	Travel_Rarely	1158	Research & Development	9	3	Medical	1	377	...	4	80	1	5	3	2	5	2	0	4
3	37	No	Travel_Rarely	228	Sales	6	4	Medical	1	378	...	2	80	1	7	5	4	5	4	0	1
4	21	No	Travel_Rarely	996	Research & Development	3	2	Medical	1	379	...	1	80	0	3	4	4	3	2	1	0

5 rows × 35 columns

Select X,y for ML training

Identify the feature variables (X), such as employee attributes, and the target variable (y), such as whether the employee left or stayed.

# create X columns list and set y column
x_cols = ["Age", "BusinessTravel", "DailyRate", "Department", "DistanceFromHome", "Education", "EducationField", "EmployeeCount", "EmployeeNumber", "EnvironmentSatisfaction", "Gender", "HourlyRate", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "Over18", "OverTime", "PercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "StandardHours", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager"]
y_col = "Attrition"
# set input matrix
X = train[x_cols]
# set target vector
y = train[y_col]
# display data shapes
print(f"X shape is {X.shape}")
print(f"y shape is {y.shape}")

X shape is (1200, 34)
y shape is (1200,)

Fit AutoML

Train the AutoML model using the fit() method to predict employee attrition.

# create automl object
automl = AutoML(total_time_limit=300, mode="Explain")
# train automl
automl.fit(X, y)

Load test data

Let's load test data. We have Target in Attrition column, so we will check accuracy of our predictions later on. We will predict it with AutoML.

# load example dataset
test = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/employee_attrition/HR-Employee-Attrition-test.csv")
# display DataFrame shape
print(f"Loaded data shape {test.shape}")
# display first rows
test.head()

Loaded data shape (270, 35)

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	4	80	1	6	3	3	2	2	2	2

5 rows × 35 columns

Compute predictions

Generate predictions on the testing data to identify the likelihood of employee turnover.

# predict with AutoML
predictions = automl.predict(test)
# predicted values
print(predictions)

['Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No'
 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'Yes' 'No'
 'No' 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'Yes' 'No'
 'No' 'Yes' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'Yes' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No'
 'No' 'No' 'No' 'No']

Display result

We want to see clearly our predictions.

# create data frame and display it  
result = pd.DataFrame(data={"Attrition": predictions})
result

	Attrition
0	Yes
1	No
2	No
3	No
4	No
...	...
265	No
266	No
267	No
268	No
269	No

270 rows × 1 columns

Compute accuracy

We need to retrieve the true values of employee attrition to compare with our predictions. After that, we compute the accuracy score.

# select true value column
true_values = test["Attrition"]
# compute metric
metric_accuracy = accuracy_score(true_values, predictions)
print(f"Accuracy: {metric_accuracy}")

Accuracy: 0.8777777777777778

Conlusions

Using MLJAR AutoML makes predicting employee attrition easier. It automatically builds and fine-tunes models, so HR teams can quickly analyze employee data and spot those likely to leave. With AutoML, there's less need for manual data work, making it a handy tool for improving staff retention.

See you soon👋.