Exploratory Data Analysis

HR Employee Attrition Analysis in Python

Explore the IBM HR Analytics dataset to uncover attrition patterns by department, age, salary, and job satisfaction.

What

This AI Data Analyst workflow loads the IBM HR Analytics attrition CSV from a URL, summarizes the dataset shape, and calculates the overall attrition rate. It generates visual comparisons of attrition rates by department and job role, and contrasts monthly income distributions for employees who left versus stayed. It also examines relationships between job satisfaction, work-life balance, and attrition using correlation analysis and a heatmap.

Who

This is for HR analysts and people analytics practitioners who need a reproducible way to explore attrition patterns in a standard benchmark dataset. It is also useful for data analysts learning exploratory analysis workflows that combine grouped summaries, distribution plots, and correlation checks.

Tools

pandas
numpy
matplotlib
seaborn

Outcomes

Loaded dataset with shape (1470, 35) and computed overall attrition rate (16.1%)
Bar chart of attrition rate by department and job role
Box plot comparing monthly income for leavers vs stayers
Correlation heatmap linking job satisfaction and work-life balance with attrition

Quality Score

7/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 2/2
Excellent
All prompted steps were attempted: dataset loaded and attrition rate computed, attrition by department/job role visualized, income distribution compared for leavers vs stayers, and satisfaction/work-life balance vs attrition visualized.
Execution Correctness: 2/2
Excellent
Provided Python code is syntactically correct and uses appropriate pandas/seaborn operations (read_csv, groupby/mean/pivot, violinplot, heatmap) that are likely to run as-is on the given dataset.
Output Quality: 1/3
Needs work
Overall attrition rate matches expected (~16.1%) and the dataset preview indicates 35 columns, but the workflow does not report the full shape (1470, 35) explicitly and does not provide the expected directional findings (e.g., Sales highest vs Research lowest; leavers earn less) beyond generic plot descriptions.
Reasoning Quality: 1/2
Needs work
Reasoning is cautious about not over-interpreting plots, but it fails to extract and state the key insights requested in the expected outcomes, mostly describing what the charts are rather than what they show.
Reliability: 1/1
Excellent
Approach is consistent and avoids fabricating numeric conclusions; transformations (AttritionFlag) are sensible and the visualizations are robust for the questions asked.

load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate

Thinking

Following loading rules

I need to follow the rules for loading data. The user asked to load a dataset and show the overall attrition rate, but the guideline is to first load and inspect the dataset if it’s not yet loaded. I should show its shape, columns, and head without combining data inspection with plotting or modeling in the same response. Since the user prioritized the attrition rate, I’ll focus on loading and inspecting the data first before calculating that rate in a follow-up.Crafting data inspection code

The developer clearly states to first load and inspect the dataset by showing the shape, columns, and head. I’m considering how to do this correctly. I can display the shape in Markdown, but I'll need runnable code. I can use display(pd.DataFrame(...)) for the shape, then display(pd.DataFrame({'columns': df.columns})) for the columns, and finally df.head() for the first few entries. This approach allows for multiple tables in sequence, but I should only provide one code block with a brief explanation, avoiding any requests for output.

Let’s start by loading the CSV and doing the required first inspection. After that, we can compute the overall attrition rate in the next step.

Load and preview dataset

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column_name": df.columns}))
df.head()

	rows	columns
0	1470	35

	column_name
0	Age
1	Attrition
2	BusinessTravel
3	DailyRate
4	Department
5	DistanceFromHome
6	Education
7	EducationField
8	EmployeeCount
9	EmployeeNumber
10	EnvironmentSatisfaction
11	Gender
12	HourlyRate
13	JobInvolvement
14	JobLevel
15	JobRole
16	JobSatisfaction
17	MaritalStatus
18	MonthlyIncome
19	MonthlyRate
20	NumCompaniesWorked
21	Over18
22	OverTime
23	PercentSalaryHike
24	PerformanceRating
25	RelationshipSatisfaction
26	StandardHours
27	StockOptionLevel
28	TotalWorkingYears
29	TrainingTimesLastYear
30	WorkLifeBalance
31	YearsAtCompany
32	YearsInCurrentRole
33	YearsSinceLastPromotion
34	YearsWithCurrManager

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	4	80	1	6	3	3	2	2	2	2

5 rows × 35 columns

The data is loaded and the target column is visible as Attrition, with values like Yes and No. The only missing piece for your request is the actual overall attrition rate.

Calculate overall attrition rate

attrition_rate = df["Attrition"].eq("Yes").mean()

pd.DataFrame({
    "overall_attrition_rate": [attrition_rate],
    "overall_attrition_percent": [attrition_rate * 100]
})

#10

	overall_attrition_rate	overall_attrition_percent
0	0.161224	16.122449

#11

The dataset contains 1,470 employee records and 35 columns, including the Attrition target.

The overall attrition rate is 0.1612, which means 16.12% of employees in this dataset have Attrition = Yes.

In practical terms, that is about 1 in 6 employees leaving.

#12

plot attrition rate by department and job role

#13

Thinking

Creating attrition rate heatmap

I need to produce a single step response with one code block to fulfill the request regarding attrition rates by department and job role. First, it makes sense to create a heatmap using seaborn, displaying both dimensions clearly. I'll compute the rate and include a title and labels. The final expression could simply involve plt.show() for plotting. I won’t include multiple code blocks since the policy prefers one per step unless explicitly requested otherwise. Let's craft that neatly!

#14

A heatmap is a good fit here because it shows attrition rate across both Department and JobRole in one chart.

#15

Plot attrition rate heatmap

import seaborn as sns
import matplotlib.pyplot as plt

rate_table = (
    df.assign(AttritionFlag=df["Attrition"].eq("Yes").astype(int))
      .groupby(["Department", "JobRole"])["AttritionFlag"]
      .mean()
      .unstack()
)

plt.figure(figsize=(14, 6))
sns.heatmap(rate_table, annot=True, fmt=".2f", cmap="Reds", cbar_kws={"label": "Attrition rate"})
plt.title("Attrition Rate by Department and Job Role")
plt.xlabel("Job Role")
plt.ylabel("Department")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

#16

#17

Thinking

Considering data display options

I’m thinking about displaying the matrix with rounding to two decimal places. An idea is to stack and sort the data in descending order to see which department or job role has the highest attrition rates. My goal is to plot the attrition rate clearly, so a long table sorted in descending order makes sense. I might opt for a compact top-N display that directly supports the chart while printing the underlying values.

#18

The heatmap was generated, so the relationship is visualized, but the latest output does not show the actual attrition-rate values needed to interpret which department/job-role combinations are highest or lowest. This table makes the plotted rates explicit.

#19

Calculate attrition rates

from IPython.display import display

attrition_rates = (
    df.assign(AttritionFlag=df["Attrition"].eq("Yes").astype(int))
      .groupby(["Department", "JobRole"], as_index=False)["AttritionFlag"]
      .mean()
      .rename(columns={"AttritionFlag": "attrition_rate"})
      .sort_values("attrition_rate", ascending=False)
)

display(attrition_rates.round({"attrition_rate": 3}))

#20

	Department	JobRole	attrition_rate
10	Sales	Sales Representative	0.398
3	Research & Development	Laboratory Technician	0.239
0	Human Resources	Human Resources	0.231
9	Sales	Sales Executive	0.175
7	Research & Development	Research Scientist	0.161
5	Research & Development	Manufacturing Director	0.069
2	Research & Development	Healthcare Representative	0.069
4	Research & Development	Manager	0.056
8	Sales	Manager	0.054
6	Research & Development	Research Director	0.025
1	Human Resources	Manager	0.000

#21

The plot and supporting table show clear differences in attrition across department and job role combinations.

Highest attrition rates:

Sales Representative (Sales): 0.398, about 39.8%
Laboratory Technician (Research & Development): 0.239, about 23.9%
Human Resources (Human Resources): 0.231, about 23.1%

Mid-range attrition:

Sales Executive (Sales): 0.175, about 17.5%
Research Scientist (Research & Development): 0.161, about 16.1%

Lowest attrition rates:

Research Director (Research & Development): 0.025, about 2.5%
Sales Manager (Sales): 0.054, about 5.4%
R&D Manager: 0.056, about 5.6%
Healthcare Representative: 0.069, about 6.9%
Manufacturing Director: 0.069, about 6.9%
HR Manager: 0.000, 0% in this dataset

Compared with the overall attrition rate of 16.1%, the plot indicates:

Sales Representatives stand out as the strongest attrition hotspot.
In Research & Development, attrition is concentrated more in Laboratory Technicians and Research Scientists than in senior roles like Research Director or Manager.
In Sales, frontline roles have much higher attrition than management.
In Human Resources, the non-manager HR role shows elevated attrition, while HR managers show none.

So the visualization answers the question well: attrition is not evenly distributed, and it is highest in specific operational/frontline roles rather than managerial roles.

#22

compare monthly income distribution for employees who left vs stayed

#23

Thinking

Crafting a code for charts

I need to consider ordering the options as ["No", "Yes"], and I might rename them to “Stayed” and “Left.” Also, I want to create a temporary plot instead of multiple charts. It’s important to provide a brief explanation alongside the data. So, let’s get started on crafting the code to achieve this in an efficient way!

#24

A good next step is a single distribution plot for MonthlyIncome split by Attrition. A boxplot is a clean first comparison because it shows median, spread, and outliers side by side.

#25

Plot monthly income boxplot

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x="Attrition", y="MonthlyIncome", order=["No", "Yes"])
plt.title("Monthly Income Distribution: Stayed vs Left")
plt.xlabel("Attrition")
plt.ylabel("Monthly Income")
plt.tight_layout()
plt.show()

#26

#27

Thinking

Deciding on A/B output

I need to choose between two paths for presenting income distribution. The current output is a boxplot that lacks numeric summaries, so I think I'll go with Path B to include the values used in the chart first. The boxplot suggests that the left group has a lower typical income and less spread, but I need to quantify this with stats. I’ll provide a minimal runnable Python code block to summarize the grouped data by median and quartiles. No need for complex expressions!Summarizing income data

I'm thinking of creating an income summary by grouping the data by attrition status and summarizing the monthly income using specific percentiles. The code I’m considering is concise and targets key statistics like count, mean, standard deviation, and percentiles. Based on the current boxplot, it seems employees who left have a lower monthly income and potentially a lower median. However, providing the numeric summaries will help substantiate those observations. This should add clarity to the findings!

#28

The boxplot already suggests a likely pattern: employees who left (Yes) appear to have a lower typical MonthlyIncome than those who stayed (No), though the figure alone does not quantify the difference.

To compare the distributions with concrete numbers, here is a compact summary of the underlying income values by attrition group:

#29

Summarize income by attrition

income_summary = (
    df.groupby("Attrition")["MonthlyIncome"]
      .describe(percentiles=[0.25, 0.5, 0.75])
      .loc[["No", "Yes"], ["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]
      .round(2)
)

income_summary

#30

	count	mean	std	min	25%	50%	75%	max
Attrition
No	1233.0	6832.74	4818.21	1051.0	3211.0	5204.0	8834.0	19999.0
Yes	237.0	4787.09	3640.21	1009.0	2373.0	3202.0	5916.0	19859.0

#31

Employees who left had a noticeably lower monthly income distribution than employees who stayed.

From the observed summary:

Stayed (No)
- Median: 5204 USD
- Mean: 6832.74 USD
- Middle 50%: 3211 to 8834 USD
Left (Yes)
- Median: 3202 USD
- Mean: 4787.09 USD
- Middle 50%: 2373 to 5916 USD

What this means:

The typical employee who left earned less than the typical employee who stayed.
The gap is clear in both the median and the mean:
- Median difference: about 2002 USD
- Mean difference: about 2045.65 USD
The boxplot is consistent with this: the whole distribution for employees who left is shifted downward.
There is still overlap between the two groups, so income alone does not fully explain attrition, but lower income is clearly associated with higher likelihood of leaving in this dataset.

So, for this comparison, the conclusion is: employees who left generally had lower monthly income than those who stayed.

#32

show correlation between job satisfaction, work-life balance and attrition

#33

Thinking

Analyzing job satisfaction and attrition

I'm thinking about how to answer the user's question about the correlation between job satisfaction, work-life balance, and attrition. I want to inspect the relationship and compute the attrition rate based on JobSatisfaction and WorkLifeBalance. Using a heatmap to visualize this makes sense because attrition is binary. I’ll create a code block for a Seaborn heatmap and briefly explain it, while leveraging the existing dataframe without needing to load anything first.

#34

Since Attrition is binary, the clearest way to show its relationship with JobSatisfaction and WorkLifeBalance is a heatmap of attrition rate across their score combinations.

#35

Plot attrition rate heatmap

import seaborn as sns
import matplotlib.pyplot as plt

heatmap_data = (
    df.assign(AttritionFlag=df["Attrition"].eq("Yes").astype(int))
      .groupby(["WorkLifeBalance", "JobSatisfaction"])["AttritionFlag"]
      .mean()
      .unstack()
)

plt.figure(figsize=(7, 5))
sns.heatmap(
    heatmap_data,
    annot=True,
    fmt=".2f",
    cmap="Reds",
    cbar_kws={"label": "Attrition rate"}
)
plt.title("Attrition Rate by Job Satisfaction and Work-Life Balance")
plt.xlabel("Job Satisfaction")
plt.ylabel("Work-Life Balance")
plt.tight_layout()
plt.show()

#36

#37

Thinking

Considering attrition analysis

I'm thinking about the current output, which provides a heatmap of attrition rate. It correlates with binary attrition, but without printed values, we can't precisely identify the combinations with the highest or lowest attrition. I should display an underlying table to make this clearer.

I might generate a rounded, sorted table. I'll use the display function for clarity and adopt a final expression to show the attrition grid without any unnecessary multiple tables. I'll focus on conditional attrition rates across levels, mentioning "correlation" loosely.Refining data presentation

I think the heatmap is a good way to show associations, but to quantify the data effectively, I definitely need a table. I’m considering crafting it without any headings for simplicity. This approach could help in focusing on the numbers themselves rather than any potential distractions from labels. It’s important to present the data clearly so the user can grasp the findings easily. I'll make sure the table conveys the necessary information.

#38

The current heatmap is the right kind of view for this question: since Attrition is binary, it shows the attrition rate for each combination of JobSatisfaction and WorkLifeBalance. That already suggests the relationship visually, but the latest output is only a figure, so the exact values behind the pattern still need to be printed to state the strongest and weakest combinations clearly.

#39

Compute attrition rate grid

attrition_grid = (
    df.assign(AttritionFlag=df["Attrition"].eq("Yes").astype(int))
      .groupby(["WorkLifeBalance", "JobSatisfaction"], as_index=False)["AttritionFlag"]
      .mean()
      .rename(columns={"AttritionFlag": "attrition_rate"})
      .sort_values("attrition_rate", ascending=False)
      .round({"attrition_rate": 3})
)

attrition_grid

#40

	WorkLifeBalance	JobSatisfaction	attrition_rate
0	1	1	0.471
2	1	3	0.381
13	4	2	0.276
1	1	2	0.222
8	3	1	0.214
4	2	1	0.213
3	1	4	0.208
12	4	1	0.207
15	4	4	0.188
6	2	3	0.178
10	3	3	0.158
7	2	4	0.153
9	3	2	0.149
5	2	2	0.138
14	4	3	0.085
11	3	4	0.072

#41

Yes—there is a clear relationship between these variables and attrition.

From the observed attrition-rate grid:

The highest attrition occurs when both factors are weakest:
- WorkLifeBalance = 1, JobSatisfaction = 1 → 47.1%
Poor work-life balance is especially associated with higher attrition across nearly all job satisfaction levels:
- WLB = 1, JS = 3 → 38.1%
- WLB = 1, JS = 2 → 22.2%
- WLB = 1, JS = 4 → 20.8%
The lowest attrition appears when these scores are stronger, especially with high job satisfaction:
- WorkLifeBalance = 3, JobSatisfaction = 4 → 7.2%
- WorkLifeBalance = 4, JobSatisfaction = 3 → 8.5%
- WorkLifeBalance = 2, JobSatisfaction = 2 → 13.8%

What this means:

Attrition tends to decrease as job satisfaction improves.
Attrition also tends to decrease as work-life balance improves.
The strongest risk appears when both are low at the same time, so their relationship with attrition is not just separate but also combined.

So, in practical terms, there is a negative association between both JobSatisfaction and WorkLifeBalance with Attrition: lower satisfaction and lower work-life balance are linked to higher employee exits. The pattern looks strongest for very poor work-life balance, which seems to amplify attrition even when job satisfaction is not the lowest.

HR Employee Attrition Analysis in Python

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics