HR Employee Attrition Analysis in Python

Explore the IBM HR Analytics dataset to uncover attrition patterns by department, age, salary, and job satisfaction.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the IBM HR Analytics attrition CSV from a URL, summarizes the dataset shape, and calculates the overall attrition rate. It generates visual comparisons of attrition rates by department and job role, and contrasts monthly income distributions for employees who left versus stayed. It also examines relationships between job satisfaction, work-life balance, and attrition using correlation analysis and a heatmap.

Who this example is for

This is for HR analysts and people analytics practitioners who need a reproducible way to explore attrition patterns in a standard benchmark dataset. It is also useful for data analysts learning exploratory analysis workflows that combine grouped summaries, distribution plots, and correlation checks.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Loaded dataset with shape (1470, 35) and computed overall attrition rate (16.1%)
  • Bar chart of attrition rate by department and job role
  • Box plot comparing monthly income for leavers vs stayers
  • Correlation heatmap linking job satisfaction and work-life balance with attrition

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate
  2. 2plot attrition rate by department and job role
  3. 3compare monthly income distribution for employees who left vs stayed
  4. 4show correlation between job satisfaction, work-life balance and attrition

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task HR Employee Attrition Analysis in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate
Thinking

Clarifying dataset loading

I need to follow the guidelines about loading the dataset and showing the overall attrition rate. The developer's note suggests that I should focus on just loading and inspecting the dataset first—like showing the shape, columns, and the first few rows—before moving on to compute the attrition rate. But since the user specifically asked for the attrition rate, I wonder if I can compute that as part of my inspection. I’ll find a balanced approach to present this effectively.Deciding on output structure

I'm thinking about the best way to approach the user's request for the attrition rate while following the guidelines. The developer emphasizes loading and inspecting the dataset before doing anything else, but the user specifically asked for the attrition rate. I wonder if I can incorporate that into the initial inspection step without making it too complicated. Maybe I can include the attrition rate as a scalar in the same response as the shape and columns. This feels like a balanced approach!Creating the inspection code

I need to create a code block for the initial inspection and summary of the dataset. The goal is to make it runnable and straightforward, using pandas. I want to display the shape, columns, and the first few rows of the DataFrame while also computing the attrition rate — I think I can handle various formats like Yes/No or 1/0 there. There may be multiple outputs, but maybe that's alright since it's focused on the summary. I'll ensure the final expression is the attrition rate, keeping everything as clear as possible.

Let's start by loading the dataset and calculating the overall attrition rate from the Attrition column.

Load and inspect attrition data
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column_name": df.columns}))
display(df.head())

attrition_rate = df["Attrition"].eq("Yes").mean()

pd.DataFrame({"overall_attrition_rate": [attrition_rate]})
rows columns
0 1470 35
column_name
0 Age
1 Attrition
2 BusinessTravel
3 DailyRate
4 Department
5 DistanceFromHome
6 Education
7 EducationField
8 EmployeeCount
9 EmployeeNumber
10 EnvironmentSatisfaction
11 Gender
12 HourlyRate
13 JobInvolvement
14 JobLevel
15 JobRole
16 JobSatisfaction
17 MaritalStatus
18 MonthlyIncome
19 MonthlyRate
20 NumCompaniesWorked
21 Over18
22 OverTime
23 PercentSalaryHike
24 PerformanceRating
25 RelationshipSatisfaction
26 StandardHours
27 StockOptionLevel
28 TotalWorkingYears
29 TrainingTimesLastYear
30 WorkLifeBalance
31 YearsAtCompany
32 YearsInCurrentRole
33 YearsSinceLastPromotion
34 YearsWithCurrManager
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 35 columns

overall_attrition_rate
0 0.161224

The dataset has 1,470 rows and 35 columns.

The observed overall attrition rate is 0.161224, which is about 16.12%. That means roughly 1 in 6 employees in this dataset have Attrition = "Yes".

plot attrition rate by department and job role
Thinking

Planning plot for attrition rate

I need to respond with a single step and code block. The user wants to plot the attrition rate by department and job role. I’ll analyze the latest outputs and propose a heatmap to illustrate this effectively, using Seaborn. It’s important to mention that the overall attrition rate is 16.12%. I'll keep it concise and ensure to include proper titles and labels, focusing on the mean of "Attrition = yes." Let's prepare this code block!

Overall attrition is about 16.12%. A heatmap is a good single-chart view for attrition rate across both Department and JobRole.

Plot attrition rate heatmap
import seaborn as sns
import matplotlib.pyplot as plt

rate_table = (
    df.assign(attrition_flag=df["Attrition"].eq("Yes"))
      .groupby(["Department", "JobRole"])["attrition_flag"]
      .mean()
      .mul(100)
      .unstack()
)

plt.figure(figsize=(14, 5))
sns.heatmap(rate_table, annot=True, fmt=".1f", cmap="Reds", linewidths=0.5)
plt.title("Attrition Rate by Department and Job Role")
plt.xlabel("Job Role")
plt.ylabel("Department")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
Thinking

Considering code for attrition rates

I'm thinking about creating minimal code just to compute and display rates. It might be good to sort the table and include counts. The user wants to plot attrition rates by department and job role, but I need to gather some missing values first. Including employee counts could help put these rates into context. A table with attrition percentage and counts might be enough. I need to keep it straightforward, though!

The heatmap was generated, so the breakdown by Department and JobRole is in place. To make it interpretable in text as well, the missing piece is the underlying attrition-rate values for each department/job-role combination.

Compute attrition rates
rate_values = (
    df.assign(attrition_flag=df["Attrition"].eq("Yes"))
      .groupby(["Department", "JobRole"])
      .agg(
          employees=("Attrition", "size"),
          attrition_rate_pct=("attrition_flag", lambda s: s.mean() * 100),
      )
      .reset_index()
      .sort_values("attrition_rate_pct", ascending=False)
)

display(rate_values.round({"attrition_rate_pct": 1}))
rate_values
Department JobRole employees attrition_rate_pct
10 Sales Sales Representative 83 39.8
3 Research & Development Laboratory Technician 259 23.9
0 Human Resources Human Resources 52 23.1
9 Sales Sales Executive 326 17.5
7 Research & Development Research Scientist 292 16.1
5 Research & Development Manufacturing Director 145 6.9
2 Research & Development Healthcare Representative 131 6.9
4 Research & Development Manager 54 5.6
8 Sales Manager 37 5.4
6 Research & Development Research Director 80 2.5
1 Human Resources Manager 11 0.0
Department JobRole employees attrition_rate_pct
10 Sales Sales Representative 83 39.759036
3 Research & Development Laboratory Technician 259 23.938224
0 Human Resources Human Resources 52 23.076923
9 Sales Sales Executive 326 17.484663
7 Research & Development Research Scientist 292 16.095890
5 Research & Development Manufacturing Director 145 6.896552
2 Research & Development Healthcare Representative 131 6.870229
4 Research & Development Manager 54 5.555556
8 Sales Manager 37 5.405405
6 Research & Development Research Director 80 2.500000
1 Human Resources Manager 11 0.000000
Thinking

Analyzing attrition data

I’m looking at the attrition data, noting the top visible combinations like Sales/Sales Representative at around 39.8% and R&D/Laboratory Technician at about 23.9%. Human Resources also seems to be among the higher attrition groups, but I should avoid exact values if they’re not visible. Interestingly, Manager and Research Director roles appear near the lower attrition end, suggesting they may have less attrition overall. The heatmap indicates that attrition isn’t uniform across roles. Let’s finalize this concisely!

The plot shows that attrition varies much more by job role than by department alone.

From the values behind the heatmap:

  • Sales → Sales Representative is the clear hotspot, at about 39.8% attrition, far above the company average of about 16.1%.
  • In Research & Development, Laboratory Technician is another high-attrition role at about 23.9%.
  • Other relatively elevated groups include Sales Executive and Research Scientist, so the heavier attrition is not spread evenly across all roles.
  • Lower-risk areas appear to be more senior roles such as Manager and Research Director, which sit near the bottom of the sorted table.

So the main takeaway from the chart is:

  • Department matters somewhat
  • but JobRole is the stronger driver of visible differences
  • with the most pronounced concern concentrated in Sales Representatives and, to a lesser extent, R&D Laboratory Technicians.
compare monthly income distribution for employees who left vs stayed
Thinking

Planning data visualization steps

I need to respond to the user’s prompt about comparing the monthly income distribution for employees who left versus those who stayed. I should propose a single plotting step, using the loaded dataset. A boxplot or a violin plot could work well. Keeping the response concise, I can create one code block for the plot, like a boxplot with a strip overlay or a simple histplot. I won’t assume anything from previous outputs, so I’ll provide the next code step clearly and stick to only one chart.Defining plotting parameters

I’m focusing on using a boxplot to compare the monthly income distribution between employees who stayed and those who left. I’ll create an ordered status variable for "Stayed" and "Left" and set the y-axis to MonthlyIncome and the x-axis to Attrition. This method is effective for visual comparisons. I'll use seaborn.boxplot, and since plots display automatically in the notebook with plt.show(), there won’t be a need for a separate expression at the end. It’s essential to stick to just one code block!

A good first comparison is a boxplot of MonthlyIncome for employees who stayed vs left. It will quickly show differences in median, spread, and outliers.

Plot monthly income by attrition
import seaborn as sns
import matplotlib.pyplot as plt

plot_df = df.copy()
plot_df["AttritionStatus"] = plot_df["Attrition"].map({"No": "Stayed", "Yes": "Left"})

plt.figure(figsize=(8, 5))
sns.boxplot(
    data=plot_df,
    x="AttritionStatus",
    y="MonthlyIncome",
    order=["Stayed", "Left"]
)
plt.title("Monthly Income Distribution: Stayed vs Left")
plt.xlabel("Employee Status")
plt.ylabel("Monthly Income")
plt.tight_layout()
plt.show()

The current output confirms the boxplot was created, but it doesn’t expose the underlying numbers, so it isn’t enough to make a concrete comparison yet.

This table will give the key distribution statistics for MonthlyIncome among employees who stayed vs left:

Summarize income by attrition
income_summary = (
    df.assign(AttritionStatus=df["Attrition"].map({"No": "Stayed", "Yes": "Left"}))
      .groupby("AttritionStatus")["MonthlyIncome"]
      .describe(percentiles=[0.25, 0.5, 0.75])
      [["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]
      .round(0)
)

income_summary
count mean std min 25% 50% 75% max
AttritionStatus
Left 237.0 4787.0 3640.0 1009.0 2373.0 3202.0 5916.0 19859.0
Stayed 1233.0 6833.0 4818.0 1051.0 3211.0 5204.0 8834.0 19999.0
Thinking

Clarifying comparison metrics

I need to take path A since we have a sufficient numeric table. The final answer should be natural and concise, focusing on comparing monthly income distribution for those who left versus those who stayed. I’ll mention important statistics like medians, quartiles, means, overlap, and variability, including potential outliers. I shouldn't restate visible outputs but can reference specific values as evidence, making sure to provide insights from the latest observed outputs without unnecessary repetition.

Employees who left had a clearly lower monthly income distribution than those who stayed.

From the observed summary:

  • Median monthly income

    • Left: 3202 USD
    • Stayed: 5204 USD
  • Average monthly income

    • Left: 4787 USD
    • Stayed: 6833 USD
  • Middle 50% of employees

    • Left: about 2373 to 5916 USD
    • Stayed: about 3211 to 8834 USD

What this means:

  • The entire distribution for employees who left is shifted lower.
  • Employees who stayed tend to earn more, both at the median and on average.
  • There is still overlap between the two groups, so income alone does not explain attrition.
  • Both groups have high-income outliers, since the maximum income is close to 20000 USD in each group.

Overall, lower monthly income is associated with higher attrition in this dataset.

show correlation between job satisfaction, work-life balance and attrition
Thinking

Analyzing correlation

I need to provide an answer based on the current state of the notebook. The user wants to see the correlation between job satisfaction, work-life balance, and attrition. I think the next step is to produce a concise analysis and maybe code for a correlation table. Since we have our dataframe loaded, I can create a numeric correlation matrix among JobSatisfaction, WorkLifeBalance, and the attrition flag, keeping in mind that attrition is binary. I'll focus on one brief explanation and a single code block. Let's craft this!

A good first cut is to convert attrition to a binary flag and compute a small correlation matrix. Since Attrition is yes/no and the other two are ordinal scores, this gives a quick directional view.

Compute correlation matrix
corr_df = df[["JobSatisfaction", "WorkLifeBalance"]].copy()
corr_df["AttritionFlag"] = df["Attrition"].eq("Yes").astype(int)

corr_df.corr().round(3)
JobSatisfaction WorkLifeBalance AttritionFlag
JobSatisfaction 1.000 -0.019 -0.103
WorkLifeBalance -0.019 1.000 -0.064
AttritionFlag -0.103 -0.064 1.000
Thinking

Creating concise insights

I'm focusing on providing a concise answer that demonstrates the correlation between job satisfaction, work-life balance, and attrition. From output-6, I've gathered numeric values for the correlation matrix, which are sufficient. I want to highlight the weak negative correlations and the near-zero relationship between job satisfaction and work-life balance. I'll also caution that correlation on binary attrition and ordinal scales is a simple directional measure, not necessarily causal. Let’s keep it focused and impactful.

The correlations are all weak, but they point in the expected direction:

  • JobSatisfaction vs Attrition: -0.103
  • WorkLifeBalance vs Attrition: -0.064
  • JobSatisfaction vs WorkLifeBalance: -0.019

What this means:

  • Employees with higher job satisfaction are slightly less likely to leave.
  • Employees with better work-life balance are also slightly less likely to leave.
  • Job satisfaction has a somewhat stronger relationship with attrition than work-life balance, but both effects are small.
  • Job satisfaction and work-life balance are almost uncorrelated with each other in this dataset, so they appear to capture different aspects of the employee experience.

Overall, both factors are negatively associated with attrition, but neither shows a strong standalone correlation, so attrition is likely influenced by additional variables too.

Score cards for gpt-5.4 for HR Employee Attrition Analysis in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    All four requested analyses were attempted: dataset loading/attrition rate, attrition by department/job role, income comparison, and correlation with satisfaction/work-life balance.

  • Execution Correctness: 2/2

    Excellent

    The provided code is syntactically coherent and the outputs shown are consistent with the computations, so it appears runnable.

  • Output Quality: 3/3

    Excellent

    The outputs match the expected outcomes semantically: 1470x35 shape, 16.12% attrition, a department/job-role attrition breakdown, income summary for left vs stayed, and the requested correlation matrix.

  • Reasoning Quality: 2/2

    Excellent

    The explanations correctly interpret the results, including the higher attrition in Sales Representative roles, lower income for leavers, and negative correlations with attrition.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and grounded in the dataset outputs, with no obvious hallucinations or unsupported claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.