Exploratory Data Analysis

HR Employee Attrition Analysis in Python

Explore the IBM HR Analytics dataset to uncover attrition patterns by department, age, salary, and job satisfaction.

What

This AI Data Analyst workflow loads the IBM HR Analytics attrition CSV from a URL, summarizes the dataset shape, and calculates the overall attrition rate. It generates visual comparisons of attrition rates by department and job role, and contrasts monthly income distributions for employees who left versus stayed. It also examines relationships between job satisfaction, work-life balance, and attrition using correlation analysis and a heatmap.

Who

This is for HR analysts and people analytics practitioners who need a reproducible way to explore attrition patterns in a standard benchmark dataset. It is also useful for data analysts learning exploratory analysis workflows that combine grouped summaries, distribution plots, and correlation checks.

Tools

  • pandas
  • numpy
  • matplotlib
  • seaborn

Outcomes

  • Loaded dataset with shape (1470, 35) and computed overall attrition rate (16.1%)
  • Bar chart of attrition rate by department and job role
  • Box plot comparing monthly income for leavers vs stayers
  • Correlation heatmap linking job satisfaction and work-life balance with attrition

Quality Score

7/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

  • Task Completion: 2/2

    Excellent

    All prompted steps were attempted: dataset loaded and attrition rate computed, attrition by department/job role visualized, income distribution compared for leavers vs stayers, and satisfaction/work-life balance vs attrition visualized.

  • Execution Correctness: 2/2

    Excellent

    Provided Python code is syntactically correct and uses appropriate pandas/seaborn operations (read_csv, groupby/mean/pivot, violinplot, heatmap) that are likely to run as-is on the given dataset.

  • Output Quality: 1/3

    Needs work

    Overall attrition rate matches expected (~16.1%) and the dataset preview indicates 35 columns, but the workflow does not report the full shape (1470, 35) explicitly and does not provide the expected directional findings (e.g., Sales highest vs Research lowest; leavers earn less) beyond generic plot descriptions.

  • Reasoning Quality: 1/2

    Needs work

    Reasoning is cautious about not over-interpreting plots, but it fails to extract and state the key insights requested in the expected outcomes, mostly describing what the charts are rather than what they show.

  • Reliability: 1/1

    Excellent

    Approach is consistent and avoids fabricating numeric conclusions; transformations (AttritionFlag) are sensible and the visualizations are robust for the questions asked.

load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate
Thinking

Following loading rules

I need to follow the rules for loading data. The user asked to load a dataset and show the overall attrition rate, but the guideline is to first load and inspect the dataset if it’s not yet loaded. I should show its shape, columns, and head without combining data inspection with plotting or modeling in the same response. Since the user prioritized the attrition rate, I’ll focus on loading and inspecting the data first before calculating that rate in a follow-up.Crafting data inspection code

The developer clearly states to first load and inspect the dataset by showing the shape, columns, and head. I’m considering how to do this correctly. I can display the shape in Markdown, but I'll need runnable code. I can use display(pd.DataFrame(...)) for the shape, then display(pd.DataFrame({'columns': df.columns})) for the columns, and finally df.head() for the first few entries. This approach allows for multiple tables in sequence, but I should only provide one code block with a brief explanation, avoiding any requests for output.

Let’s start by loading the CSV and doing the required first inspection. After that, we can compute the overall attrition rate in the next step.

Load and preview dataset
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column_name": df.columns}))
df.head()
rows columns
0 1470 35
column_name
0 Age
1 Attrition
2 BusinessTravel
3 DailyRate
4 Department
5 DistanceFromHome
6 Education
7 EducationField
8 EmployeeCount
9 EmployeeNumber
10 EnvironmentSatisfaction
11 Gender
12 HourlyRate
13 JobInvolvement
14 JobLevel
15 JobRole
16 JobSatisfaction
17 MaritalStatus
18 MonthlyIncome
19 MonthlyRate
20 NumCompaniesWorked
21 Over18
22 OverTime
23 PercentSalaryHike
24 PerformanceRating
25 RelationshipSatisfaction
26 StandardHours
27 StockOptionLevel
28 TotalWorkingYears
29 TrainingTimesLastYear
30 WorkLifeBalance
31 YearsAtCompany
32 YearsInCurrentRole
33 YearsSinceLastPromotion
34 YearsWithCurrManager
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 35 columns

The data is loaded and the target column is visible as Attrition, with values like Yes and No. The only missing piece for your request is the actual overall attrition rate.

Calculate overall attrition rate
attrition_rate = df["Attrition"].eq("Yes").mean()

pd.DataFrame({
    "overall_attrition_rate": [attrition_rate],
    "overall_attrition_percent": [attrition_rate * 100]
})
overall_attrition_rate overall_attrition_percent
0 0.161224 16.122449

The dataset contains 1,470 employee records and 35 columns, including the Attrition target.

The overall attrition rate is 0.1612, which means 16.12% of employees in this dataset have Attrition = Yes.

In practical terms, that is about 1 in 6 employees leaving.

plot attrition rate by department and job role
Thinking

Creating attrition rate heatmap

I need to produce a single step response with one code block to fulfill the request regarding attrition rates by department and job role. First, it makes sense to create a heatmap using seaborn, displaying both dimensions clearly. I'll compute the rate and include a title and labels. The final expression could simply involve plt.show() for plotting. I won’t include multiple code blocks since the policy prefers one per step unless explicitly requested otherwise. Let's craft that neatly!

A heatmap is a good fit here because it shows attrition rate across both Department and JobRole in one chart.

Plot attrition rate heatmap
import seaborn as sns
import matplotlib.pyplot as plt

rate_table = (
    df.assign(AttritionFlag=df["Attrition"].eq("Yes").astype(int))
      .groupby(["Department", "JobRole"])["AttritionFlag"]
      .mean()
      .unstack()
)

plt.figure(figsize=(14, 6))
sns.heatmap(rate_table, annot=True, fmt=".2f", cmap="Reds", cbar_kws={"label": "Attrition rate"})
plt.title("Attrition Rate by Department and Job Role")
plt.xlabel("Job Role")
plt.ylabel("Department")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
Thinking

Considering data display options

I’m thinking about displaying the matrix with rounding to two decimal places. An idea is to stack and sort the data in descending order to see which department or job role has the highest attrition rates. My goal is to plot the attrition rate clearly, so a long table sorted in descending order makes sense. I might opt for a compact top-N display that directly supports the chart while printing the underlying values.

The heatmap was generated, so the relationship is visualized, but the latest output does not show the actual attrition-rate values needed to interpret which department/job-role combinations are highest or lowest. This table makes the plotted rates explicit.

Calculate attrition rates
from IPython.display import display

attrition_rates = (
    df.assign(AttritionFlag=df["Attrition"].eq("Yes").astype(int))
      .groupby(["Department", "JobRole"], as_index=False)["AttritionFlag"]
      .mean()
      .rename(columns={"AttritionFlag": "attrition_rate"})
      .sort_values("attrition_rate", ascending=False)
)

display(attrition_rates.round({"attrition_rate": 3}))
Department JobRole attrition_rate
10 Sales Sales Representative 0.398
3 Research & Development Laboratory Technician 0.239
0 Human Resources Human Resources 0.231
9 Sales Sales Executive 0.175
7 Research & Development Research Scientist 0.161
5 Research & Development Manufacturing Director 0.069
2 Research & Development Healthcare Representative 0.069
4 Research & Development Manager 0.056
8 Sales Manager 0.054
6 Research & Development Research Director 0.025
1 Human Resources Manager 0.000

The plot and supporting table show clear differences in attrition across department and job role combinations.

Highest attrition rates:

  • Sales Representative (Sales): 0.398, about 39.8%
  • Laboratory Technician (Research & Development): 0.239, about 23.9%
  • Human Resources (Human Resources): 0.231, about 23.1%

Mid-range attrition:

  • Sales Executive (Sales): 0.175, about 17.5%
  • Research Scientist (Research & Development): 0.161, about 16.1%

Lowest attrition rates:

  • Research Director (Research & Development): 0.025, about 2.5%
  • Sales Manager (Sales): 0.054, about 5.4%
  • R&D Manager: 0.056, about 5.6%
  • Healthcare Representative: 0.069, about 6.9%
  • Manufacturing Director: 0.069, about 6.9%
  • HR Manager: 0.000, 0% in this dataset

Compared with the overall attrition rate of 16.1%, the plot indicates:

  • Sales Representatives stand out as the strongest attrition hotspot.
  • In Research & Development, attrition is concentrated more in Laboratory Technicians and Research Scientists than in senior roles like Research Director or Manager.
  • In Sales, frontline roles have much higher attrition than management.
  • In Human Resources, the non-manager HR role shows elevated attrition, while HR managers show none.

So the visualization answers the question well: attrition is not evenly distributed, and it is highest in specific operational/frontline roles rather than managerial roles.

compare monthly income distribution for employees who left vs stayed
Thinking

Crafting a code for charts

I need to consider ordering the options as ["No", "Yes"], and I might rename them to “Stayed” and “Left.” Also, I want to create a temporary plot instead of multiple charts. It’s important to provide a brief explanation alongside the data. So, let’s get started on crafting the code to achieve this in an efficient way!

A good next step is a single distribution plot for MonthlyIncome split by Attrition. A boxplot is a clean first comparison because it shows median, spread, and outliers side by side.

Plot monthly income boxplot
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x="Attrition", y="MonthlyIncome", order=["No", "Yes"])
plt.title("Monthly Income Distribution: Stayed vs Left")
plt.xlabel("Attrition")
plt.ylabel("Monthly Income")
plt.tight_layout()
plt.show()
Thinking

Deciding on A/B output

I need to choose between two paths for presenting income distribution. The current output is a boxplot that lacks numeric summaries, so I think I'll go with Path B to include the values used in the chart first. The boxplot suggests that the left group has a lower typical income and less spread, but I need to quantify this with stats. I’ll provide a minimal runnable Python code block to summarize the grouped data by median and quartiles. No need for complex expressions!Summarizing income data

I'm thinking of creating an income summary by grouping the data by attrition status and summarizing the monthly income using specific percentiles. The code I’m considering is concise and targets key statistics like count, mean, standard deviation, and percentiles. Based on the current boxplot, it seems employees who left have a lower monthly income and potentially a lower median. However, providing the numeric summaries will help substantiate those observations. This should add clarity to the findings!

The boxplot already suggests a likely pattern: employees who left (Yes) appear to have a lower typical MonthlyIncome than those who stayed (No), though the figure alone does not quantify the difference.

To compare the distributions with concrete numbers, here is a compact summary of the underlying income values by attrition group:

Summarize income by attrition
income_summary = (
    df.groupby("Attrition")["MonthlyIncome"]
      .describe(percentiles=[0.25, 0.5, 0.75])
      .loc[["No", "Yes"], ["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]
      .round(2)
)

income_summary
count mean std min 25% 50% 75% max
Attrition
No 1233.0 6832.74 4818.21 1051.0 3211.0 5204.0 8834.0 19999.0
Yes 237.0 4787.09 3640.21 1009.0 2373.0 3202.0 5916.0 19859.0

Employees who left had a noticeably lower monthly income distribution than employees who stayed.

From the observed summary:

  • Stayed (No)

    • Median: 5204 USD
    • Mean: 6832.74 USD
    • Middle 50%: 3211 to 8834 USD
  • Left (Yes)

    • Median: 3202 USD
    • Mean: 4787.09 USD
    • Middle 50%: 2373 to 5916 USD

What this means:

  • The typical employee who left earned less than the typical employee who stayed.
  • The gap is clear in both the median and the mean:
    • Median difference: about 2002 USD
    • Mean difference: about 2045.65 USD
  • The boxplot is consistent with this: the whole distribution for employees who left is shifted downward.
  • There is still overlap between the two groups, so income alone does not fully explain attrition, but lower income is clearly associated with higher likelihood of leaving in this dataset.

So, for this comparison, the conclusion is: employees who left generally had lower monthly income than those who stayed.

show correlation between job satisfaction, work-life balance and attrition
Thinking

Analyzing job satisfaction and attrition

I'm thinking about how to answer the user's question about the correlation between job satisfaction, work-life balance, and attrition. I want to inspect the relationship and compute the attrition rate based on JobSatisfaction and WorkLifeBalance. Using a heatmap to visualize this makes sense because attrition is binary. I’ll create a code block for a Seaborn heatmap and briefly explain it, while leveraging the existing dataframe without needing to load anything first.

Since Attrition is binary, the clearest way to show its relationship with JobSatisfaction and WorkLifeBalance is a heatmap of attrition rate across their score combinations.

Plot attrition rate heatmap
import seaborn as sns
import matplotlib.pyplot as plt

heatmap_data = (
    df.assign(AttritionFlag=df["Attrition"].eq("Yes").astype(int))
      .groupby(["WorkLifeBalance", "JobSatisfaction"])["AttritionFlag"]
      .mean()
      .unstack()
)

plt.figure(figsize=(7, 5))
sns.heatmap(
    heatmap_data,
    annot=True,
    fmt=".2f",
    cmap="Reds",
    cbar_kws={"label": "Attrition rate"}
)
plt.title("Attrition Rate by Job Satisfaction and Work-Life Balance")
plt.xlabel("Job Satisfaction")
plt.ylabel("Work-Life Balance")
plt.tight_layout()
plt.show()
Thinking

Considering attrition analysis

I'm thinking about the current output, which provides a heatmap of attrition rate. It correlates with binary attrition, but without printed values, we can't precisely identify the combinations with the highest or lowest attrition. I should display an underlying table to make this clearer.

I might generate a rounded, sorted table. I'll use the display function for clarity and adopt a final expression to show the attrition grid without any unnecessary multiple tables. I'll focus on conditional attrition rates across levels, mentioning "correlation" loosely.Refining data presentation

I think the heatmap is a good way to show associations, but to quantify the data effectively, I definitely need a table. I’m considering crafting it without any headings for simplicity. This approach could help in focusing on the numbers themselves rather than any potential distractions from labels. It’s important to present the data clearly so the user can grasp the findings easily. I'll make sure the table conveys the necessary information.

The current heatmap is the right kind of view for this question: since Attrition is binary, it shows the attrition rate for each combination of JobSatisfaction and WorkLifeBalance. That already suggests the relationship visually, but the latest output is only a figure, so the exact values behind the pattern still need to be printed to state the strongest and weakest combinations clearly.

Compute attrition rate grid
attrition_grid = (
    df.assign(AttritionFlag=df["Attrition"].eq("Yes").astype(int))
      .groupby(["WorkLifeBalance", "JobSatisfaction"], as_index=False)["AttritionFlag"]
      .mean()
      .rename(columns={"AttritionFlag": "attrition_rate"})
      .sort_values("attrition_rate", ascending=False)
      .round({"attrition_rate": 3})
)

attrition_grid
WorkLifeBalance JobSatisfaction attrition_rate
0 1 1 0.471
2 1 3 0.381
13 4 2 0.276
1 1 2 0.222
8 3 1 0.214
4 2 1 0.213
3 1 4 0.208
12 4 1 0.207
15 4 4 0.188
6 2 3 0.178
10 3 3 0.158
7 2 4 0.153
9 3 2 0.149
5 2 2 0.138
14 4 3 0.085
11 3 4 0.072

Yes—there is a clear relationship between these variables and attrition.

From the observed attrition-rate grid:

  • The highest attrition occurs when both factors are weakest:
    • WorkLifeBalance = 1, JobSatisfaction = 1 → 47.1%
  • Poor work-life balance is especially associated with higher attrition across nearly all job satisfaction levels:
    • WLB = 1, JS = 3 → 38.1%
    • WLB = 1, JS = 2 → 22.2%
    • WLB = 1, JS = 4 → 20.8%
  • The lowest attrition appears when these scores are stronger, especially with high job satisfaction:
    • WorkLifeBalance = 3, JobSatisfaction = 4 → 7.2%
    • WorkLifeBalance = 4, JobSatisfaction = 3 → 8.5%
    • WorkLifeBalance = 2, JobSatisfaction = 2 → 13.8%

What this means:

  • Attrition tends to decrease as job satisfaction improves.
  • Attrition also tends to decrease as work-life balance improves.
  • The strongest risk appears when both are low at the same time, so their relationship with attrition is not just separate but also combined.

So, in practical terms, there is a negative association between both JobSatisfaction and WorkLifeBalance with Attrition: lower satisfaction and lower work-life balance are linked to higher employee exits. The pattern looks strongest for very poor work-life balance, which seems to amplify attrition even when job satisfaction is not the lowest.

Browse All Analysis Examples