HR Employee Attrition Analysis in Python

Explore the IBM HR Analytics dataset to uncover attrition patterns by department, age, salary, and job satisfaction.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the IBM HR Analytics attrition CSV from a URL, summarizes the dataset shape, and calculates the overall attrition rate. It generates visual comparisons of attrition rates by department and job role, and contrasts monthly income distributions for employees who left versus stayed. It also examines relationships between job satisfaction, work-life balance, and attrition using correlation analysis and a heatmap.

Who this example is for

This is for HR analysts and people analytics practitioners who need a reproducible way to explore attrition patterns in a standard benchmark dataset. It is also useful for data analysts learning exploratory analysis workflows that combine grouped summaries, distribution plots, and correlation checks.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Loaded dataset with shape (1470, 35) and computed overall attrition rate (16.1%)
  • Bar chart of attrition rate by department and job role
  • Box plot comparing monthly income for leavers vs stayers
  • Correlation heatmap linking job satisfaction and work-life balance with attrition

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate
  2. 2plot attrition rate by department and job role
  3. 3compare monthly income distribution for employees who left vs stayed
  4. 4show correlation between job satisfaction, work-life balance and attrition

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3.5:397b

This run uses AI with a step-by-step list of prompts to perform the analysis task HR Employee Attrition Analysis in Python. Below, you can review the full conversation generated with qwen3.5:397b. After the conversation, we show detailed score cards for this run.

load HR attrition dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv and show overall attrition rate
Thinking

The user wants me to load the HR attrition dataset from the provided URL and show the overall attrition rate. I'll do this in one step:

  1. Load the CSV file using pandas
  2. Show basic info about the dataset (shape, columns, head)
  3. Calculate and display the overall attrition rate

Let me write a single code block to accomplish this.

I'll load the HR attrition dataset and calculate the overall attrition rate.

Notebook code
import pandas as pd

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/employee_attrition/HR-Employee-Attrition-All.csv')

# Show basic info
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst 5 rows:")
df.head()

# Calculate overall attrition rate
attrition_rate = df['Attrition'].value_counts(normalize=True) * 100
attrition_rate
Dataset shape: (1470, 35)

Columns: ['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']

First 5 rows:
Attrition
No     83.877551
Yes    16.122449
Name: proportion, dtype: float64
Thinking

The user asked to load the HR attrition dataset and show the overall attrition rate. Looking at the notebook outputs:

  1. Dataset loaded successfully: shape (1470, 35)
  2. Columns are displayed (35 columns including 'Attrition')
  3. Attrition rate is calculated: No = 83.88%, Yes = 16.12%

The goal is complete - we have:

  • Dataset loaded
  • Overall attrition rate shown (16.12% of employees left)

This is Path A - the outputs are sufficient to answer the original goal. I should provide a natural final answer with the insights from the observed outputs.

The dataset has been loaded successfully with 1,470 employees and 35 features.

Overall Attrition Rate:

  • No (stayed): 83.88%
  • Yes (left): 16.12%

This means approximately 16 out of every 100 employees in this dataset have left the company. The attrition rate of 16.12% is the key metric for understanding employee turnover in this HR dataset.

plot attrition rate by department and job role
Thinking

The user wants to visualize attrition rate by department and job role. I'll create a visualization showing the attrition rates across these two categorical variables. I'll use seaborn to create clear bar charts.

Let me create a plot that shows attrition rate by department first, then we can look at job role in a follow-up if needed, or I can try to combine them meaningfully.

Actually, let me think about the best approach:

  1. I could show two separate charts - one for department, one for job role
  2. Or I could show a combined view

For clarity and following the "one chart per step" guideline, I'll start with attrition rate by department, which gives a good overview. Then in a follow-up, I can show job role breakdown.

Let me create a bar chart showing the percentage of employees who left (Attrition = Yes) for each department.

I'll create visualizations showing attrition rate by department and job role. Let me start with a clear bar chart for department-wise attrition.

Plot attrition rate by department
import seaborn as sns
import matplotlib.pyplot as plt

# Set style
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(10, 6))

# Calculate attrition rate by Department
dept_attrition = df.groupby('Department')['Attrition'].apply(lambda x: (x == 'Yes').mean() * 100).sort_values(ascending=False)

# Create bar plot
sns.barplot(x=dept_attrition.values, y=dept_attrition.index, palette='coolwarm', ax=ax)

ax.set_xlabel('Attrition Rate (%)', fontsize=12)
ax.set_ylabel('Department', fontsize=12)
ax.set_title('Employee Attrition Rate by Department', fontsize=14, fontweight='bold')

# Add value labels on bars
for i, v in enumerate(dept_attrition.values):
    ax.text(v + 0.5, i, f'{v:.1f}%', va='center', fontsize=10)

plt.tight_layout()
plt.show()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_3056/3852997355.py:12: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=dept_attrition.values, y=dept_attrition.index, palette='coolwarm', ax=ax)
Thinking

Looking at the notebook state and outputs:

  1. Cell-1 loaded the HR attrition dataset (1470 rows, 35 columns) and calculated overall attrition rate (16.12% Yes, 83.88% No)

  2. Cell-2 created a bar chart showing attrition rate by Department. The output shows a Figure was created.

The original user goal is to "plot attrition rate by department and job role".

Currently:

  • Department attrition plot has been created (Cell-2)
  • Job role attrition plot has NOT been created yet

So the goal is not fully complete - we still need the job role visualization.

However, I need to check the sufficiency heuristics:

  • The latest output (Cell-2) is chart-only (Figure object) without underlying numeric values
  • For ranking tasks, I should not choose Path A from chart-only output

But more importantly, the original goal asks for BOTH department AND job role. We only have department done.

So I should choose Path B - provide the job role attrition plot since that's the missing piece to complete the original goal.

Let me create a similar bar chart for JobRole attrition rates.

The department chart shows attrition rates across the three departments. Now I need to create the job role visualization to complete the original goal.

Plot attrition by job role
import seaborn as sns
import matplotlib.pyplot as plt

# Set style
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(12, 8))

# Calculate attrition rate by JobRole
job_attrition = df.groupby('JobRole')['Attrition'].apply(lambda x: (x == 'Yes').mean() * 100).sort_values(ascending=False)

# Create horizontal bar plot
sns.barplot(x=job_attrition.values, y=job_attrition.index, palette='coolwarm', ax=ax)

ax.set_xlabel('Attrition Rate (%)', fontsize=12)
ax.set_ylabel('Job Role', fontsize=12)
ax.set_title('Employee Attrition Rate by Job Role', fontsize=14, fontweight='bold')

# Add value labels on bars
for i, v in enumerate(job_attrition.values):
    ax.text(v + 0.5, i, f'{v:.1f}%', va='center', fontsize=10)

plt.tight_layout()
plt.show()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_3056/1805525421.py:12: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=job_attrition.values, y=job_attrition.index, palette='coolwarm', ax=ax)
Thinking

Looking at the notebook state, I have:

  1. Dataset loaded with 1470 rows and 35 columns including 'Department', 'JobRole', and 'Attrition'
  2. Overall attrition rate: ~16.12%
  3. Cell-2 created a department attrition rate chart (output shows figure was created)
  4. Cell-3 created a job role attrition rate chart (output shows figure was created)

Both visualizations have been created successfully. However, I notice the outputs only show "<Figure size...>" which means the charts were generated but I don't see the actual numeric values displayed in the output.

According to the sufficiency heuristics:

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."
  • "For ranking tasks (for example top features, feature importance, top-N categories), do not choose Path A from chart-only output; first produce a value table/list via Path B."

Since the outputs only show figure objects without the actual numeric attrition rates displayed, I should choose Path B and print the underlying values in a table format. This will give the user concrete numbers to understand the attrition rates by department and job role.

Let me create a simple table showing the actual attrition rate values.

Both charts have been generated successfully. However, I should display the actual numeric values so you can see the exact attrition rates. Let me print a summary table with the underlying data.

Calculate attrition rates by department and role
# Display attrition rates by Department
print("=== Attrition Rate by Department ===")
dept_attrition = df.groupby('Department')['Attrition'].apply(lambda x: (x == 'Yes').mean() * 100).sort_values(ascending=False)
print(dept_attrition.round(1).to_string())

print("\n=== Attrition Rate by Job Role ===")
job_attrition = df.groupby('JobRole')['Attrition'].apply(lambda x: (x == 'Yes').mean() * 100).sort_values(ascending=False)
print(job_attrition.round(1).to_string())
=== Attrition Rate by Department ===
Department
Sales                     20.6
Human Resources           19.0
Research & Development    13.8

=== Attrition Rate by Job Role ===
JobRole
Sales Representative         39.8
Laboratory Technician        23.9
Human Resources              23.1
Sales Executive              17.5
Research Scientist           16.1
Manufacturing Director        6.9
Healthcare Representative     6.9
Manager                       4.9
Research Director             2.5
compare monthly income distribution for employees who left vs stayed
Thinking

The user wants to compare monthly income distribution for employees who left (Attrition='Yes') vs those who stayed (Attrition='No'). This is a new analysis step. I should create a visualization that shows the distribution of MonthlyIncome for both groups.

A good approach would be to use:

  1. A box plot or violin plot to show the distribution
  2. Or overlapping histograms/density plots

Let me create a clear visualization comparing MonthlyIncome by Attrition status. I'll use a box plot combined with a violin plot or just a clean box plot with swarm plot for individual points.

Let me write one code block for this analysis.

I'll create a visualization comparing the MonthlyIncome distribution between employees who left and those who stayed.

Plot income by attrition
import seaborn as sns
import matplotlib.pyplot as plt

# Set style
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(10, 6))

# Create box plot with swarm plot for individual data points
sns.boxplot(x='Attrition', y='MonthlyIncome', data=df, palette='coolwarm', ax=ax, order=['No', 'Yes'])

# Add title and labels
ax.set_xlabel('Attrition Status', fontsize=12)
ax.set_ylabel('Monthly Income (USD)', fontsize=12)
ax.set_title('Monthly Income Distribution: Employees Who Left vs Stayed', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_3056/932604478.py:9: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='Attrition', y='MonthlyIncome', data=df, palette='coolwarm', ax=ax, order=['No', 'Yes'])
Thinking

The user's original goal is to "compare monthly income distribution for employees who left vs stayed". Looking at the notebook state, cell-5 has executed a box plot comparing MonthlyIncome by Attrition status (Yes/No). The output shows a figure was generated.

However, the box plot alone doesn't provide the actual numeric values - we can see the visualization but not the underlying statistics like mean, median, min, max income for each group. According to the sufficiency heuristics:

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

So I should choose Path B and provide code to display the actual numeric statistics for MonthlyIncome by Attrition status. This will give concrete values to complement the visualization.

Let me write minimal code to show the underlying statistics.

The box plot in cell-5 shows the MonthlyIncome distribution visually, but we don't have the actual numeric values to make a concrete comparison. Let me extract the underlying statistics.

Describe income by attrition
# Display MonthlyIncome statistics by Attrition status
df.groupby('Attrition')['MonthlyIncome'].describe()
count mean std min 25% 50% 75% max
Attrition
No 1233.0 6832.739659 4818.208001 1051.0 3211.0 5204.0 8834.0 19999.0
Yes 237.0 4787.092827 3640.210367 1009.0 2373.0 3202.0 5916.0 19859.0
Thinking

The output from cell-6 provides comprehensive statistics for MonthlyIncome by Attrition status. I now have all the numeric data needed to answer the original goal of comparing monthly income distribution between employees who left vs stayed.

Key insights from the data:

  • Employees who stayed (No): mean income 6832.74 USD, median 5204.0 USD
  • Employees who left (Yes): mean income 4787.09 USD, median 3202.0 USD
  • The difference is substantial: those who left earned about 2045.65 USD less on average
  • The 25th percentile: 3211.0 USD (stayed) vs 2373.0 USD (left)
  • The 75th percentile: 8834.0 USD (stayed) vs 5916.0 USD (left)

This clearly shows employees who left had significantly lower monthly income across all percentiles. The box plot in cell-5 visualizes this, and cell-6 provides the numeric backing.

This is sufficient to answer the original goal - I should provide Path A with a final answer.

The data clearly shows a significant income gap between employees who left and those who stayed:

Key Findings:

MetricStayed (No)Left (Yes)Difference
Mean Income6832.74 USD4787.09 USD2045.65 USD lower
Median Income5204.0 USD3202.0 USD2002.0 USD lower
25th Percentile3211.0 USD2373.0 USD838.0 USD lower
75th Percentile8834.0 USD5916.0 USD2918.0 USD lower

Interpretation:

Employees who left earned substantially less across the entire income distribution. On average, those who attrited made about 2045.65 USD less per month than those who stayed. The median tells a similar story - leavers earned 3202.0 USD compared to 5204.0 USD for stayers.

The gap is even more pronounced at the higher end: the 75th percentile for stayers (8834.0 USD) is nearly 3000 USD higher than for leavers (5916.0 USD). This suggests lower compensation is a strong factor associated with employee attrition.

show correlation between job satisfaction, work-life balance and attrition
Thinking

The user wants to analyze the correlation between job satisfaction, work-life balance, and attrition. I need to first check what columns are available in the dataset related to these factors. Based on the dataset structure from cell-1, I should look for columns like JobSatisfaction, WorkLifeBalance, and Attrition.

Let me create a code block to examine these relationships. I'll start by checking the available columns and their values, then create a visualization to show the correlations.

Let me examine the relationship between JobSatisfaction, WorkLifeBalance, and Attrition. First, I'll check the available columns and their value distributions.

Check categorical value counts
# Check the satisfaction and work-life balance columns
print("=== Job Satisfaction Values ===")
print(df['JobSatisfaction'].value_counts().sort_index())

print("\n=== Work Life Balance Values ===")
print(df['WorkLifeBalance'].value_counts().sort_index())

print("\n=== Attrition Values ===")
print(df['Attrition'].value_counts())
=== Job Satisfaction Values ===
JobSatisfaction
1    289
2    280
3    442
4    459
Name: count, dtype: int64

=== Work Life Balance Values ===
WorkLifeBalance
1     80
2    344
3    893
4    153
Name: count, dtype: int64

=== Attrition Values ===
Attrition
No     1233
Yes     237
Name: count, dtype: int64
Thinking

Looking at the notebook state, I can see that cell-7 has just been executed and shows the value distributions for JobSatisfaction, WorkLifeBalance, and Attrition. However, this only shows the raw counts - it doesn't show the correlation or relationship between these variables and attrition.

The original user goal is to "show correlation between job satisfaction, work-life balance and attrition". The current outputs only show:

  • JobSatisfaction distribution (1-4 scale)
  • WorkLifeBalance distribution (1-4 scale)
  • Attrition counts (No: 1233, Yes: 237)

What's missing is the actual relationship/correlation between these satisfaction scores and attrition status. I need to show:

  1. How attrition rates vary by JobSatisfaction level
  2. How attrition rates vary by WorkLifeBalance level

This is Path B - I need to gather more evidence to answer the original goal. Let me create a code block that shows attrition rates by satisfaction levels.

The current outputs show the value distributions for JobSatisfaction and WorkLifeBalance (both on 1-4 scales), but don't yet show how these relate to attrition. I need to calculate attrition rates at each satisfaction level to reveal the correlation.

Calculate attrition by satisfaction and balance
# Calculate attrition rate by Job Satisfaction level
print("=== Attrition Rate by Job Satisfaction ===")
job_sat_attrition = df.groupby('JobSatisfaction')['Attrition'].apply(lambda x: (x == 'Yes').mean() * 100).sort_index()
print(job_sat_attrition.round(1).to_string())

print("\n=== Attrition Rate by Work Life Balance ===")
wlb_attrition = df.groupby('WorkLifeBalance')['Attrition'].apply(lambda x: (x == 'Yes').mean() * 100).sort_index()
print(wlb_attrition.round(1).to_string())
=== Attrition Rate by Job Satisfaction ===
JobSatisfaction
1    22.8
2    16.4
3    16.5
4    11.3

=== Attrition Rate by Work Life Balance ===
WorkLifeBalance
1    31.2
2    16.9
3    14.2
4    17.6
Thinking

The user wants to see the correlation between job satisfaction, work-life balance, and attrition. Looking at the latest outputs (cell-8), I now have the attrition rates by Job Satisfaction and Work Life Balance levels:

Job Satisfaction vs Attrition:

  • Level 1 (lowest): 22.8% attrition
  • Level 2: 16.4% attrition
  • Level 3: 16.5% attrition
  • Level 4 (highest): 11.3% attrition

This shows a clear negative correlation - as job satisfaction increases, attrition decreases.

Work Life Balance vs Attrition:

  • Level 1 (worst): 31.2% attrition
  • Level 2: 16.9% attrition
  • Level 3: 14.2% attrition
  • Level 4 (best): 17.6% attrition

This shows that poor work-life balance (level 1) has the highest attrition rate, but interestingly level 4 has slightly higher attrition than level 3.

These outputs directly answer the original goal of showing correlation between job satisfaction, work-life balance, and attrition. The numeric values clearly demonstrate the relationships. I should provide a final answer summarizing these insights.

However, a visualization would make this clearer and more impactful. Let me consider whether to provide a final answer (Path A) or create a visualization (Path B).

The outputs contain explicit numeric values that answer the goal. According to the sufficiency heuristics: "Prefer Path A whenever observed outputs already answer the goal." The numeric attrition rates by satisfaction level clearly show the correlation.

I'll choose Path A and provide a natural final answer with the insights from the observed outputs.

The data reveals clear correlations between job satisfaction, work-life balance, and employee attrition:

Job Satisfaction vs Attrition There's a strong negative correlation - as satisfaction increases, attrition decreases:

  • Satisfaction Level 1 (lowest): 22.8% attrition
  • Satisfaction Level 2: 16.4% attrition
  • Satisfaction Level 3: 16.5% attrition
  • Satisfaction Level 4 (highest): 11.3% attrition

Employees with the lowest job satisfaction are 2x more likely to leave compared to those with the highest satisfaction.

Work-Life Balance vs Attrition Poor work-life balance is the strongest predictor of attrition:

  • Balance Level 1 (worst): 31.2% attrition
  • Balance Level 2: 16.9% attrition
  • Balance Level 3: 14.2% attrition
  • Balance Level 4 (best): 17.6% attrition

Employees reporting the worst work-life balance have more than double the attrition rate (31.2%) compared to those with good balance (14.2% at level 3). Interestingly, level 4 shows a slight uptick to 17.6%, which may indicate other factors at play for that small group (only 153 employees).

Key Takeaway: Work-life balance appears to be a stronger driver of attrition than job satisfaction overall, with the worst balance category showing the highest attrition rate across all factors examined.

Score cards for qwen3.5:397b for HR Employee Attrition Analysis in Python

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    The workflow covers all requested analyses: dataset loading, overall attrition rate, department/job role attrition, income comparison, and satisfaction/work-life balance correlations.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically valid and the outputs indicate it ran successfully, including the expected dataset shape and computed summaries.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: 16.12% attrition, shape (1470, 35), department/job role attrition patterns, lower income for leavers, and higher attrition with low satisfaction and poor work-life balance.

  • Reasoning Quality: 2/2

    Excellent

    The notebook reasoning is clear and aligned with the analysis goals, correctly interpreting the charts and summary statistics.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and evidence-based, with no major hallucinations or contradictory claims.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.