HR Employee Attrition Analysis in Python

Explore the IBM HR Analytics dataset to uncover attrition patterns by department, age, salary, and job satisfaction.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the IBM HR Analytics attrition CSV from a URL, summarizes the dataset shape, and calculates the overall attrition rate. It generates visual comparisons of attrition rates by department and job role, and contrasts monthly income distributions for employees who left versus stayed. It also examines relationships between job satisfaction, work-life balance, and attrition using correlation analysis and a heatmap.

Who this example is for

This is for HR analysts and people analytics practitioners who need a reproducible way to explore attrition patterns in a standard benchmark dataset. It is also useful for data analysts learning exploratory analysis workflows that combine grouped summaries, distribution plots, and correlation checks.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Loaded dataset with shape (1470, 35) and computed overall attrition rate (16.1%)
Bar chart of attrition rate by department and job role
Box plot comparing monthly income for leavers vs stayers
Correlation heatmap linking job satisfaction and work-life balance with attrition

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate
2plot attrition rate by department and job role
3compare monthly income distribution for employees who left vs stayed
4show correlation between job satisfaction, work-life balance and attrition

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task HR Employee Attrition Analysis in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate

Thinking

Clarifying dataset loading

I need to follow the guidelines about loading the dataset and showing the overall attrition rate. The developer's note suggests that I should focus on just loading and inspecting the dataset first—like showing the shape, columns, and the first few rows—before moving on to compute the attrition rate. But since the user specifically asked for the attrition rate, I wonder if I can compute that as part of my inspection. I’ll find a balanced approach to present this effectively.Deciding on output structure

I'm thinking about the best way to approach the user's request for the attrition rate while following the guidelines. The developer emphasizes loading and inspecting the dataset before doing anything else, but the user specifically asked for the attrition rate. I wonder if I can incorporate that into the initial inspection step without making it too complicated. Maybe I can include the attrition rate as a scalar in the same response as the shape and columns. This feels like a balanced approach!Creating the inspection code

I need to create a code block for the initial inspection and summary of the dataset. The goal is to make it runnable and straightforward, using pandas. I want to display the shape, columns, and the first few rows of the DataFrame while also computing the attrition rate — I think I can handle various formats like Yes/No or 1/0 there. There may be multiple outputs, but maybe that's alright since it's focused on the summary. I'll ensure the final expression is the attrition rate, keeping everything as clear as possible.

Let's start by loading the dataset and calculating the overall attrition rate from the Attrition column.

Load and inspect attrition data

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column_name": df.columns}))
display(df.head())

attrition_rate = df["Attrition"].eq("Yes").mean()

pd.DataFrame({"overall_attrition_rate": [attrition_rate]})

	rows	columns
0	1470	35

	column_name
0	Age
1	Attrition
2	BusinessTravel
3	DailyRate
4	Department
5	DistanceFromHome
6	Education
7	EducationField
8	EmployeeCount
9	EmployeeNumber
10	EnvironmentSatisfaction
11	Gender
12	HourlyRate
13	JobInvolvement
14	JobLevel
15	JobRole
16	JobSatisfaction
17	MaritalStatus
18	MonthlyIncome
19	MonthlyRate
20	NumCompaniesWorked
21	Over18
22	OverTime
23	PercentSalaryHike
24	PerformanceRating
25	RelationshipSatisfaction
26	StandardHours
27	StockOptionLevel
28	TotalWorkingYears
29	TrainingTimesLastYear
30	WorkLifeBalance
31	YearsAtCompany
32	YearsInCurrentRole
33	YearsSinceLastPromotion
34	YearsWithCurrManager

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	4	80	1	6	3	3	2	2	2	2

5 rows × 35 columns

	overall_attrition_rate
0	0.161224

The dataset has 1,470 rows and 35 columns.

The observed overall attrition rate is 0.161224, which is about 16.12%. That means roughly 1 in 6 employees in this dataset have Attrition = "Yes".

#10

plot attrition rate by department and job role

#11

Thinking

Planning plot for attrition rate

I need to respond with a single step and code block. The user wants to plot the attrition rate by department and job role. I’ll analyze the latest outputs and propose a heatmap to illustrate this effectively, using Seaborn. It’s important to mention that the overall attrition rate is 16.12%. I'll keep it concise and ensure to include proper titles and labels, focusing on the mean of "Attrition = yes." Let's prepare this code block!

#12

Overall attrition is about 16.12%. A heatmap is a good single-chart view for attrition rate across both Department and JobRole.

#13

Plot attrition rate heatmap

import seaborn as sns
import matplotlib.pyplot as plt

rate_table = (
    df.assign(attrition_flag=df["Attrition"].eq("Yes"))
      .groupby(["Department", "JobRole"])["attrition_flag"]
      .mean()
      .mul(100)
      .unstack()
)

plt.figure(figsize=(14, 5))
sns.heatmap(rate_table, annot=True, fmt=".1f", cmap="Reds", linewidths=0.5)
plt.title("Attrition Rate by Department and Job Role")
plt.xlabel("Job Role")
plt.ylabel("Department")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()

#14

#15

Thinking

Considering code for attrition rates

I'm thinking about creating minimal code just to compute and display rates. It might be good to sort the table and include counts. The user wants to plot attrition rates by department and job role, but I need to gather some missing values first. Including employee counts could help put these rates into context. A table with attrition percentage and counts might be enough. I need to keep it straightforward, though!

#16

The heatmap was generated, so the breakdown by Department and JobRole is in place. To make it interpretable in text as well, the missing piece is the underlying attrition-rate values for each department/job-role combination.

#17

Compute attrition rates

rate_values = (
    df.assign(attrition_flag=df["Attrition"].eq("Yes"))
      .groupby(["Department", "JobRole"])
      .agg(
          employees=("Attrition", "size"),
          attrition_rate_pct=("attrition_flag", lambda s: s.mean() * 100),
      )
      .reset_index()
      .sort_values("attrition_rate_pct", ascending=False)
)

display(rate_values.round({"attrition_rate_pct": 1}))
rate_values

#18

	Department	JobRole	employees	attrition_rate_pct
10	Sales	Sales Representative	83	39.8
3	Research & Development	Laboratory Technician	259	23.9
0	Human Resources	Human Resources	52	23.1
9	Sales	Sales Executive	326	17.5
7	Research & Development	Research Scientist	292	16.1
5	Research & Development	Manufacturing Director	145	6.9
2	Research & Development	Healthcare Representative	131	6.9
4	Research & Development	Manager	54	5.6
8	Sales	Manager	37	5.4
6	Research & Development	Research Director	80	2.5
1	Human Resources	Manager	11	0.0

#19

	Department	JobRole	employees	attrition_rate_pct
10	Sales	Sales Representative	83	39.759036
3	Research & Development	Laboratory Technician	259	23.938224
0	Human Resources	Human Resources	52	23.076923
9	Sales	Sales Executive	326	17.484663
7	Research & Development	Research Scientist	292	16.095890
5	Research & Development	Manufacturing Director	145	6.896552
2	Research & Development	Healthcare Representative	131	6.870229
4	Research & Development	Manager	54	5.555556
8	Sales	Manager	37	5.405405
6	Research & Development	Research Director	80	2.500000
1	Human Resources	Manager	11	0.000000

#20

Thinking

Analyzing attrition data

I’m looking at the attrition data, noting the top visible combinations like Sales/Sales Representative at around 39.8% and R&D/Laboratory Technician at about 23.9%. Human Resources also seems to be among the higher attrition groups, but I should avoid exact values if they’re not visible. Interestingly, Manager and Research Director roles appear near the lower attrition end, suggesting they may have less attrition overall. The heatmap indicates that attrition isn’t uniform across roles. Let’s finalize this concisely!

#21

The plot shows that attrition varies much more by job role than by department alone.

From the values behind the heatmap:

Sales → Sales Representative is the clear hotspot, at about 39.8% attrition, far above the company average of about 16.1%.
In Research & Development, Laboratory Technician is another high-attrition role at about 23.9%.
Other relatively elevated groups include Sales Executive and Research Scientist, so the heavier attrition is not spread evenly across all roles.
Lower-risk areas appear to be more senior roles such as Manager and Research Director, which sit near the bottom of the sorted table.

So the main takeaway from the chart is:

Department matters somewhat
but JobRole is the stronger driver of visible differences
with the most pronounced concern concentrated in Sales Representatives and, to a lesser extent, R&D Laboratory Technicians.

#22

compare monthly income distribution for employees who left vs stayed

#23

Thinking

Planning data visualization steps

I need to respond to the user’s prompt about comparing the monthly income distribution for employees who left versus those who stayed. I should propose a single plotting step, using the loaded dataset. A boxplot or a violin plot could work well. Keeping the response concise, I can create one code block for the plot, like a boxplot with a strip overlay or a simple histplot. I won’t assume anything from previous outputs, so I’ll provide the next code step clearly and stick to only one chart.Defining plotting parameters

I’m focusing on using a boxplot to compare the monthly income distribution between employees who stayed and those who left. I’ll create an ordered status variable for "Stayed" and "Left" and set the y-axis to MonthlyIncome and the x-axis to Attrition. This method is effective for visual comparisons. I'll use seaborn.boxplot, and since plots display automatically in the notebook with plt.show(), there won’t be a need for a separate expression at the end. It’s essential to stick to just one code block!

#24

A good first comparison is a boxplot of MonthlyIncome for employees who stayed vs left. It will quickly show differences in median, spread, and outliers.

#25

Plot monthly income by attrition

import seaborn as sns
import matplotlib.pyplot as plt

plot_df = df.copy()
plot_df["AttritionStatus"] = plot_df["Attrition"].map({"No": "Stayed", "Yes": "Left"})

plt.figure(figsize=(8, 5))
sns.boxplot(
    data=plot_df,
    x="AttritionStatus",
    y="MonthlyIncome",
    order=["Stayed", "Left"]
)
plt.title("Monthly Income Distribution: Stayed vs Left")
plt.xlabel("Employee Status")
plt.ylabel("Monthly Income")
plt.tight_layout()
plt.show()

#26

#27

The current output confirms the boxplot was created, but it doesn’t expose the underlying numbers, so it isn’t enough to make a concrete comparison yet.

This table will give the key distribution statistics for MonthlyIncome among employees who stayed vs left:

#28

Summarize income by attrition

income_summary = (
    df.assign(AttritionStatus=df["Attrition"].map({"No": "Stayed", "Yes": "Left"}))
      .groupby("AttritionStatus")["MonthlyIncome"]
      .describe(percentiles=[0.25, 0.5, 0.75])
      [["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]
      .round(0)
)

income_summary

#29

	count	mean	std	min	25%	50%	75%	max
AttritionStatus
Left	237.0	4787.0	3640.0	1009.0	2373.0	3202.0	5916.0	19859.0
Stayed	1233.0	6833.0	4818.0	1051.0	3211.0	5204.0	8834.0	19999.0

#30

Thinking

Clarifying comparison metrics

I need to take path A since we have a sufficient numeric table. The final answer should be natural and concise, focusing on comparing monthly income distribution for those who left versus those who stayed. I’ll mention important statistics like medians, quartiles, means, overlap, and variability, including potential outliers. I shouldn't restate visible outputs but can reference specific values as evidence, making sure to provide insights from the latest observed outputs without unnecessary repetition.

#31

Employees who left had a clearly lower monthly income distribution than those who stayed.

From the observed summary:

Median monthly income
- Left: 3202 USD
- Stayed: 5204 USD
Average monthly income
- Left: 4787 USD
- Stayed: 6833 USD
Middle 50% of employees
- Left: about 2373 to 5916 USD
- Stayed: about 3211 to 8834 USD

What this means:

The entire distribution for employees who left is shifted lower.
Employees who stayed tend to earn more, both at the median and on average.
There is still overlap between the two groups, so income alone does not explain attrition.
Both groups have high-income outliers, since the maximum income is close to 20000 USD in each group.

Overall, lower monthly income is associated with higher attrition in this dataset.

#32

show correlation between job satisfaction, work-life balance and attrition

#33

Thinking

Analyzing correlation

I need to provide an answer based on the current state of the notebook. The user wants to see the correlation between job satisfaction, work-life balance, and attrition. I think the next step is to produce a concise analysis and maybe code for a correlation table. Since we have our dataframe loaded, I can create a numeric correlation matrix among JobSatisfaction, WorkLifeBalance, and the attrition flag, keeping in mind that attrition is binary. I'll focus on one brief explanation and a single code block. Let's craft this!

#34

A good first cut is to convert attrition to a binary flag and compute a small correlation matrix. Since Attrition is yes/no and the other two are ordinal scores, this gives a quick directional view.

#35

Compute correlation matrix

corr_df = df[["JobSatisfaction", "WorkLifeBalance"]].copy()
corr_df["AttritionFlag"] = df["Attrition"].eq("Yes").astype(int)

corr_df.corr().round(3)

#36

	JobSatisfaction	WorkLifeBalance	AttritionFlag
JobSatisfaction	1.000	-0.019	-0.103
WorkLifeBalance	-0.019	1.000	-0.064
AttritionFlag	-0.103	-0.064	1.000

#37

Thinking

Creating concise insights

I'm focusing on providing a concise answer that demonstrates the correlation between job satisfaction, work-life balance, and attrition. From output-6, I've gathered numeric values for the correlation matrix, which are sufficient. I want to highlight the weak negative correlations and the near-zero relationship between job satisfaction and work-life balance. I'll also caution that correlation on binary attrition and ordinal scales is a simple directional measure, not necessarily causal. Let’s keep it focused and impactful.

#38

The correlations are all weak, but they point in the expected direction:

JobSatisfaction vs Attrition: -0.103
WorkLifeBalance vs Attrition: -0.064
JobSatisfaction vs WorkLifeBalance: -0.019

What this means:

Employees with higher job satisfaction are slightly less likely to leave.
Employees with better work-life balance are also slightly less likely to leave.
Job satisfaction has a somewhat stronger relationship with attrition than work-life balance, but both effects are small.
Job satisfaction and work-life balance are almost uncorrelated with each other in this dataset, so they appear to capture different aspects of the employee experience.

Overall, both factors are negatively associated with attrition, but neither shows a strong standalone correlation, so attrition is likely influenced by additional variables too.

Score cards for gpt-5.4 for HR Employee Attrition Analysis in Python

Total score: 10/10

Task Completion: 2/2
Excellent
All four requested analyses were attempted: dataset loading/attrition rate, attrition by department/job role, income comparison, and correlation with satisfaction/work-life balance.
Execution Correctness: 2/2
Excellent
The provided code is syntactically coherent and the outputs shown are consistent with the computations, so it appears runnable.
Output Quality: 3/3
Excellent
The outputs match the expected outcomes semantically: 1470x35 shape, 16.12% attrition, a department/job-role attrition breakdown, income summary for left vs stayed, and the requested correlation matrix.
Reasoning Quality: 2/2
Excellent
The explanations correctly interpret the results, including the higher attrition in Sales Representative roles, lower income for leavers, and negative correlations with attrition.
Reliability: 1/1
Excellent
The workflow is consistent and grounded in the dataset outputs, with no obvious hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

HR Employee Attrition Analysis in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gpt-5.4

Score cards for gpt-5.4 for HR Employee Attrition Analysis in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics