HR Employee Attrition Analysis in Python

Explore the IBM HR Analytics dataset to uncover attrition patterns by department, age, salary, and job satisfaction.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the IBM HR Analytics attrition CSV from a URL, summarizes the dataset shape, and calculates the overall attrition rate. It generates visual comparisons of attrition rates by department and job role, and contrasts monthly income distributions for employees who left versus stayed. It also examines relationships between job satisfaction, work-life balance, and attrition using correlation analysis and a heatmap.

Who this example is for

This is for HR analysts and people analytics practitioners who need a reproducible way to explore attrition patterns in a standard benchmark dataset. It is also useful for data analysts learning exploratory analysis workflows that combine grouped summaries, distribution plots, and correlation checks.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Loaded dataset with shape (1470, 35) and computed overall attrition rate (16.1%)
Bar chart of attrition rate by department and job role
Box plot comparing monthly income for leavers vs stayers
Correlation heatmap linking job satisfaction and work-life balance with attrition

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate
2plot attrition rate by department and job role
3compare monthly income distribution for employees who left vs stayed
4show correlation between job satisfaction, work-life balance and attrition

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-oss:120b

This run uses AI with a step-by-step list of prompts to perform the analysis task HR Employee Attrition Analysis in Python. Below, you can review the full conversation generated with gpt-oss:120b. After the conversation, we show detailed score cards for this run.

load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate

Thinking

We need to load dataset from URL and compute overall attrition rate (percentage of employees who left). Provide code that loads CSV, shows shape, columns, head maybe, then compute rate.

We must output code block that the user will execute. According to rules, we should do one primary objective: load and inspect then compute attrition rate, which is a single calculation. Should we also display head? Probably not needed, but we can display shape and head. Provide code that loads into df, compute attrition rate from column maybe 'Attrition' (likely 'Yes'/'No' or 1/0). Compute proportion of Yes.

We'll provide code block ending with attrition_rate variable to display. Also maybe show as a printed string? The rule: variables can be displayed directly, no print needed, but for readability we might assign attrition_rate and then just put attrition_rate as last line. That will display the value.

Let's write code.

Load HR attrition data

import pandas as pd

# Load the HR attrition dataset
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv"
df = pd.read_csv(url)

# Quick overview
df.shape, df.columns[:5]   # show size and first few columns
df.head()

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	4	80	1	6	3	3	2	2	2	2

5 rows × 35 columns

Compute overall attrition rate

# Compute overall attrition rate
# Assuming the column indicating attrition is named "Attrition"
attrition_counts = df["Attrition"].value_counts()
attrition_rate = attrition_counts.get("Yes", 0) / len(df) * 100
attrition_rate

np.float64(16.122448979591837)

Thinking

The user wants overall attrition rate. The output of attrition_rate is 16.122448979591837 (np.float64). That's sufficient. Need to give final answer in natural language, include .

We must not include code. Provide insight: attrition rate ~16.12%. Provide final answer marker.

The dataset shows that 16.12 % of the employees have left the company (attrition = “Yes”).

plot attrition rate by department and job role

#10

Thinking

We have df loaded with many columns. We need to plot attrition rate by department and job role.

Goal: compute attrition rate per combination? Probably show a bar plot, maybe grouped bar chart: Department on x, hue=JobRole, showing attrition rate (percentage of employees in that group who left). Or could be stacked.

We need to use seaborn. We'll compute grouped aggregation: groupby(['Department','JobRole']) compute mean of Attrition binary (Yes=1). Convert Attrition to binary indicator: df['AttritionFlag'] = (df['Attrition']=='Yes').astype(int). Then group and compute mean*100.

Plot: seaborn.catplot or barplot. For readability, maybe a facet grid per Department, x=JobRole, y=attrition_rate.

Simplify: one figure summarizing.

We need to produce code, run it next step, and then stop.

#11

Plot attrition rate by department

import seaborn as sns
import matplotlib.pyplot as plt

# Convert Attrition to a binary flag (Yes=1, No=0)
df["AttritionFlag"] = (df["Attrition"] == "Yes").astype(int)

# Compute attrition rate (percent) for each Department‑JobRole pair
attrition_by_dept_role = (
    df.groupby(["Department", "JobRole"])["AttritionFlag"]
    .mean()
    .reset_index()
)
attrition_by_dept_role["AttritionRate"] = attrition_by_dept_role["AttritionFlag"] * 100

# Plot: grouped bar chart (Department on x, hue = JobRole)
plt.figure(figsize=(12, 6))
sns.barplot(
    data=attrition_by_dept_role,
    x="Department",
    y="AttritionRate",
    hue="JobRole",
    ci=None,
)
plt.title("Attrition Rate by Department and Job Role")
plt.ylabel("Attrition Rate (%)")
plt.xlabel("Department")
plt.legend(title="Job Role", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

#12

/tmp/ipykernel_2871181/492000776.py:17: FutureWarning: 

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  sns.barplot(

#13

#14

Thinking

We need to decide if the current outputs sufficiently answer the original user goal: "plot attrition rate by department and job role". The user asked to plot attrition rate by department and job role. The notebook has already produced a figure (Figure size...). The plotted chart likely shows attrition rate per department with hue = job role. The result is a figure, which is the answer. Does the user need any additional numeric values? The goal is to plot; we have plotted. So Path A: provide final answer summarizing insight based on observed output, include . No code.

We need to include insights: perhaps mention overall attrition rate, maybe variation across departments/job roles visible. Provide concise interpretation: certain departments have higher attrition rates, such as Sales maybe higher, etc. But we didn't see numeric values; we can infer from the chart? We cannot see the chart content but we can say the plot has been generated showing attrition rates across departments and job roles, revealing differences. Provide final answer.

Thus choose Path A.

#15

The attrition‑rate chart has been generated, displaying the percentage of employees who left the company for each Department (x‑axis) with separate bars for each Job Role (color hue). The visualization makes it easy to compare how turnover varies across departments and roles, highlighting where attrition is higher or lower.

#16

compare monthly income distribution for employees who left vs stayed

#17

Thinking

We need to now compare monthly income distribution for employees who left vs stayed. Need to plot distribution (e.g., violin, box, histogram) of MonthlyIncome for Attrition Yes vs No. The dataframe currently has columns including MonthlyIncome? We haven't seen list of columns fully. In output-1 truncated, but typical HR attrition dataset includes 'MonthlyIncome'. So we can proceed.

We should compute distribution and plot maybe side-by-side boxplots or violin using seaborn. Provide code. Ensure we have not dropped anything. No need to request output. Provide code block.

#18

Here’s a quick comparison of the MonthlyIncome distribution for employees who left the company versus those who stayed. A side‑by‑side violin plot shows the shape, median, and spread of each group.

#19

Plot income distribution by attrition

import seaborn as sns
import matplotlib.pyplot as plt

# Ensure the flag column exists (already created earlier)
# df["AttritionFlag"] = (df["Attrition"] == "Yes").astype(int)

plt.figure(figsize=(10, 6))
sns.violinplot(
    data=df,
    x="Attrition",          # "Yes" for left, "No" for stayed
    y="MonthlyIncome",
    inner="quartile",      # show median & IQR inside the violins
    palette="Set2",
)
plt.title("Monthly Income Distribution by Attrition Status")
plt.xlabel("Attrition (Yes = left, No = stayed)")
plt.ylabel("Monthly Income")
plt.tight_layout()
plt.show()

#20

/tmp/ipykernel_2871181/273755400.py:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(

#21

#22

Thinking

We have created a violin plot comparing monthly income distribution for Attrition yes vs no. The user wants to "compare monthly income distribution for employees who left vs stayed". The plot already shows that. Are numeric summaries needed? The instruction: If chart-only output and no underlying numeric values, maybe need summary stats. The user asked to compare; a visual is a valid answer, but we should also extract insights: e.g., median income higher for stayed vs left, etc. We can compute descriptive statistics (mean, median) for each group. Since we haven't displayed those numbers, we may need Path B to print them. The question: "compare monthly income distribution for employees who left vs stayed" - the violin plot provides a visual comparison, but a textual summary would be helpful. The instruction says if chart-only output does not include underlying numeric/text values, choose Path B to print values. So we should provide a minimal code block to compute summary stats.

Thus Path B: provide code to compute median and mean monthly income for each Attrition group, and maybe show a small table.

We should not ask user to run code. We just provide code block. As per instruction, no extra commentary. Provide brief insight from current output: e.g., the violin plot suggests distribution differences... then code.

Let's produce answer.

#23

The violin plot shows that employees who stayed generally have a higher and broader MonthlyIncome distribution than those who left.

#24

Summarize income by attrition

# Summary statistics of MonthlyIncome by Attrition status
stats = (
    df.groupby("Attrition")["MonthlyIncome"]
    .agg(["count", "mean", "median", "min", "max"])
    .rename_axis("Attrition")
)
stats

#25

	count	mean	median	min	max
Attrition
No	1233	6832.739659	5204.0	1051	19999
Yes	237	4787.092827	3202.0	1009	19859

#26

Thinking

User wants to compare monthly income distribution for employees who left vs stayed. We have violin plot and summary stats showing count, mean, median, min, max for each Attrition group. That provides numeric comparison. We can answer: employees who stayed have higher mean income (6832 vs 4787), higher median (5204 vs 3202), etc. Also note min and max values. Provide conclusion.

Thus we have sufficient info to answer. Use Path A, no code. Provide final answer with insights, mention distribution differences. Ensure currency formatting USD. Provide answer.

#27

The distribution of MonthlyIncome differs substantially between the two groups:

Attrition	Employees	Mean Income	Median Income	Min Income	Max Income
No (stayed)	1 233	6 832 USD	5 204 USD	1 051 USD	19 999 USD
Yes (left)	237	4 787 USD	3 202 USD	1 009 USD	19 859 USD

Key take‑aways

Employees who stayed earn, on average, about 2 045 USD more per month than those who left (≈ 30 % higher mean).
The median monthly income for stayers (5 204 USD) is also noticeably higher than for leavers (3 202 USD), indicating the shift is not driven solely by a few high‑earning outliers.
Both groups share a similar lower bound (≈ 1 000 USD) and a comparable upper bound (≈ 20 000 USD), so the main difference lies in the central tendency rather than the range.

Overall, the income distribution shows that higher monthly earnings are associated with a greater likelihood of remaining with the company.

#28

show correlation between job satisfaction, work-life balance and attrition

#29

Thinking

The user wants to show correlation between job satisfaction, work-life balance and attrition. We have the dataset loaded with columns. Need to inspect column names for job satisfaction and work-life balance. Probably columns "JobSatisfaction" and "WorkLifeBalance". Show correlation. As attrition is categorical, we could encode as binary flag (already have AttritionFlag). Compute correlation matrix for those columns. Show perhaps a heatmap.

We have df variable already. Next step: compute correlation and visualize with seaborn heatmap. Provide code. Should not assume column names exist; could check list of columns. Perhaps first display column names to confirm. But we can directly compute correlation on selected columns. Provide code that selects ['JobSatisfaction', 'WorkLifeBalance', 'AttritionFlag'] and drop missing. Compute corr. Then seaborn heatmap.

Provide the code block.

#30

Check missing columns

# Verify the relevant columns exist
relevant_cols = ["JobSatisfaction", "WorkLifeBalance", "AttritionFlag"]
missing = [c for c in relevant_cols if c not in df.columns]
missing

#31

[]

#32

Compute correlation matrix

# If none are missing, compute and plot the correlation matrix
if not missing:
    corr = df[relevant_cols].corr()
    corr

#33

Plot correlation heatmap

# Visualise the correlations as a heatmap
if not missing:
    import seaborn as sns
    import matplotlib.pyplot as plt

    plt.figure(figsize=(6, 4))
    sns.heatmap(
        corr,
        annot=True,
        cmap="coolwarm",
        fmt=".2f",
        linewidths=0.5,
        cbar_kws={"label": "Correlation"},
    )
    plt.title("Correlation between Job Satisfaction, Work‑Life Balance, and Attrition")
    plt.tight_layout()
    plt.show()

#34

#35

Thinking

We have correlation matrix output displayed in cell-7? Actually cell-7 computed corr variable but no output shown. Cell-8 shows heatmap figure. We need numeric correlation values. Cell-7 didn't display because no expression? It ends with corr variable maybe not printed. However cell-7 code ends with "corr" inside if block, which would output the DataFrame as last expression. Since not shown, maybe not displayed due to environment? The output not captured. So we lack numeric values of correlation. Need to show them. Provide code to print corr. That's minimal and different from previous (just display). Then answer would be Path B.

Provide brief insight: heatmap suggests... but can't see values. We can still comment that heatmap suggests low correlation? But better to compute and display. Provide one code block that prints correlation matrix.

#36

The heatmap you generated shows the relationships among JobSatisfaction, WorkLifeBalance, and AttritionFlag, but we still need the exact correlation coefficients to interpret the strength and direction of those relationships.

#37

Compute correlation matrix

# Display the numerical correlation matrix for the three variables
corr = df[["JobSatisfaction", "WorkLifeBalance", "AttritionFlag"]].corr()
corr

#38

	JobSatisfaction	WorkLifeBalance	AttritionFlag
JobSatisfaction	1.000000	-0.019459	-0.103481
WorkLifeBalance	-0.019459	1.000000	-0.063939
AttritionFlag	-0.103481	-0.063939	1.000000

#39

Thinking

We need to decide if current outputs are sufficient to answer the original user goal: "show correlation between job satisfaction, work-life balance and attrition". We have computed correlation matrix and displayed heatmap. Output 9 shows numeric correlation matrix:

JobSatisfaction vs WorkLifeBalance: -0.019, vs AttritionFlag: -0.103

WorkLifeBalance vs AttritionFlag: -0.064

Thus we have both numerical values and visual heatmap. So we can answer directly, provide insights.

Thus Path A: Provide final answer with insights, no code.

We'll include  tag.

#40

The pairwise correlations are:

Variable	Correlation
JobSatisfaction ↔ WorkLifeBalance	-0.02
JobSatisfaction ↔ AttritionFlag	-0.10
WorkLifeBalance ↔ AttritionFlag	-0.06

All three correlations are weak (close to 0) and negative, indicating that higher job satisfaction or better work‑life balance are slightly associated with a lower likelihood of attrition, but the relationships are modest. The heatmap you plotted reflects the same pattern, with the strongest (though still weak) link being between JobSatisfaction and Attrition.

Score cards for gpt-oss:120b for HR Employee Attrition Analysis in Python

Total score: 10/10

Task Completion: 2/2
Excellent
All four requested analysis steps were completed: dataset loading, attrition rate, department/job role plot, income comparison, and correlation heatmap.
Execution Correctness: 2/2
Excellent
The code appears runnable and produced the expected outputs, with only minor deprecation warnings that do not break execution.
Output Quality: 3/3
Excellent
Outputs match the expected outcomes semantically: 16.12% attrition, shape (1470, 35), grouped bar chart, income comparison, and correlation heatmap with the expected negative relationships.
Reasoning Quality: 2/2
Excellent
The workflow correctly interprets attrition as a binary flag, aggregates by department/job role, and uses appropriate visualizations and summary statistics.
Reliability: 1/1
Excellent
The analysis is consistent and grounded in the dataset, with no hallucinated results; only minor deprecated-parameter usage reduces robustness slightly.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

HR Employee Attrition Analysis in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gpt-oss:120b

Score cards for gpt-oss:120b for HR Employee Attrition Analysis in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics