Data Cleaning with Pandas in Python

Clean a messy real-world dataset: handle missing values, fix dtypes, remove duplicates, and standardize columns using pandas and an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the World Happiness Report 2024 CSV and profiles data quality issues such as missing values, incorrect dtypes, and duplicate rows. It cleans the dataset by dropping columns with more than 50% missingness, filling remaining numeric nulls with column medians, and converting numeric-looking strings to floats. It standardizes column names to snake_case and produces a before/after summary comparing shape, dtypes, and null counts.

Who this example is for

This is for analysts and data scientists who need a repeatable pandas-based cleaning routine for a real-world CSV. It helps anyone preparing data for downstream analysis by making cleaning steps explicit and verifiable with before/after checks.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Data quality report covering missing values, dtype problems, and duplicate rows
Columns with >50% missing values removed
Numeric missing values filled with column medians
Snake_case column names and corrected numeric dtypes
Before/after comparison of dataset shape and null counts

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
python
jupyter

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load the dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv and identify data quality issues: missing values, wrong dtypes, duplicates
2fill missing numeric values with column medians and drop columns with >50% missing
3rename columns to snake_case and convert any string numbers to float
4show a before/after summary of the cleaned dataset

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gpt-5.4	8/10	2/2	2/2	2/3	1/2	1/1	Open gpt-5.4 conversation
glm-5.1	7/10	1/2	2/2	2/3	1/2	1/1	Open glm-5.1 conversation
gpt-oss:120b	6/10	2/2	1/2	2/3	1/2	0/1	Open gpt-oss:120b conversation
gemma4:31b	5/10	1/2	1/2	2/3	1/2	0/1	Open gemma4:31b conversation
qwen3.5:397b	5/10	1/2	1/2	2/3	1/2	0/1	Open qwen3.5:397b conversation
qwen3-coder-next	3/10	1/2	0/2	1/3	1/2	0/1	Open qwen3-coder-next conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3.5:397b

This run uses AI with a step-by-step list of prompts to perform the analysis task Data Cleaning with Pandas in Python. Below, you can review the full conversation generated with qwen3.5:397b. After the conversation, we show detailed score cards for this run.

load the dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv and identify data quality issues: missing values, wrong dtypes, duplicates

Thinking

The user wants me to load a dataset from a URL and identify data quality issues. I'll start by loading the data and doing an initial inspection - showing shape, columns, dtypes, and checking for missing values and duplicates.

Let me write a single code block that:

Loads the CSV
Shows basic info (shape, columns, dtypes)
Checks for missing values
Checks for duplicates

I'll keep this focused on data inspection only, as per the guidelines.

I'll load the dataset and perform an initial data quality inspection.

Load and inspect dataset

import pandas as pd

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv')

# Basic info
print("Dataset shape:", df.shape)
print("\nColumn dtypes:")
df.dtypes

# Check missing values
print("\nMissing values:")
df.isnull().sum()

# Check duplicates
print("\nDuplicate rows:", df.duplicated().sum())

Dataset shape: (143, 9)

Column dtypes:

Missing values:

Duplicate rows: 0

Thinking

Looking at the notebook outputs, I can see the dataset was loaded successfully with shape (143, 9). However, the output shows "Column dtypes:", "Missing values:", and "Duplicate rows: 0" but the actual dtype and missing value information is not displayed in the output - it appears the code printed the labels but the actual Series/DataFrame values weren't shown properly.

The issue is that in the code, df.dtypes and df.isnull().sum() are bare expressions at the end of print statements, but they're not being displayed. The code structure has issues - the print statements don't actually show the dtype and missing value data.

I need to fix this to properly display:

Column dtypes
Missing values count and percentage
Duplicate count (this one shows 0)

Let me provide a corrected code block that will properly display all this information.

The dataset loaded with 143 rows and 9 columns. The duplicate check shows 0 duplicate rows. However, the dtype and missing value information wasn't properly displayed in the output. Let me fix this to show the complete data quality assessment.

Display column types and missing values

# Display column dtypes
print("Column Dtypes:")
df.dtypes

# Display missing values with percentages
print("\nMissing Values:")
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing_Count': missing, 'Missing_Percent': missing_pct})
missing_df[missing_df['Missing_Count'] > 0]

Column Dtypes:

Missing Values:

#10

	Missing_Count	Missing_Percent
region	6	4.2
gdp_per_capita	3	2.1
social_support	3	2.1
healthy_life_expectancy	3	2.1
freedom_to_make_life_choices	3	2.1
generosity	3	2.1
perceptions_of_corruption	3	2.1

#11

Thinking

Looking at the latest outputs, I have good information about the data quality issues:

Dataset shape: 143 rows, 9 columns
Duplicates: 0 duplicate rows (no issue)
Missing values: Found in 7 columns:
- region: 6 missing (4.2%)
- gdp_per_capita: 3 missing (2.1%)
- social_support: 3 missing (2.1%)
- healthy_life_expectancy: 3 missing (2.1%)
- freedom_to_make_life_choices: 3 missing (2.1%)
- generosity: 3 missing (2.1%)
- perceptions_of_corruption: 3 missing (2.1%)
Dtypes: The dtype output wasn't properly displayed in the notebook output (shows empty after "Column Dtypes:"). I need to verify the dtypes to complete the data quality assessment.

The original goal was to identify: missing values ✓, wrong dtypes ?, duplicates ✓

I have information on missing values and duplicates, but the dtype information wasn't properly captured in the output. I should verify the dtypes to complete the assessment. This is Path B - I need one more minimal step to show the dtypes properly.

#12

Based on the latest outputs, I can see:

Missing Values: 7 columns have missing data - region has the most with 6 missing values (4.2%), while 6 other columns (gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption) each have 3 missing values (2.1%).

Duplicates: No duplicate rows found (0 duplicates).

Dtypes: The dtype output wasn't captured properly. I need to verify this to complete the assessment.

#13

Show column data types

# Show column dtypes clearly
df.dtypes

#14

country                             str
region                              str
happiness_score                 float64
gdp_per_capita                  float64
social_support                  float64
healthy_life_expectancy         float64
freedom_to_make_life_choices    float64
generosity                      float64
perceptions_of_corruption       float64
dtype: object

#15

Thinking

Looking at the latest outputs from the notebook, I have all the information needed to answer the original user goal:

Original Goal: Load the dataset and identify data quality issues: missing values, wrong dtypes, duplicates

What I've gathered from outputs:

Dataset loaded: 143 rows, 9 columns ✓
Missing values (from output-2):
- region: 6 missing (4.2%)
- 6 other columns have 3 missing each (2.1%): gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption
- country and happiness_score have no missing values
Dtypes (from output-3):
- country, region: str (appropriate for categorical text data)
- All numeric columns (happiness_score and the 6 indicators): float64 (appropriate for continuous numeric data)
- No wrong dtypes detected
Duplicates (from output-1): 0 duplicate rows ✓

All three data quality aspects requested have been assessed. I can now provide a final answer summarizing the findings.

#16

Based on the notebook outputs, I have all the information needed to complete the data quality assessment:

Data Quality Summary for WHR_2024.csv (143 rows, 9 columns):

Issue	Finding
Missing Values	7 of 9 columns have missing data. `region` has the most with 6 missing values (4.2%). Six indicator columns each have 3 missing values (2.1%): `gdp_per_capita`, `social_support`, `healthy_life_expectancy`, `freedom_to_make_life_choices`, `generosity`, `perceptions_of_corruption`
Wrong Dtypes	None detected. Text columns (`country`, `region`) are correctly stored as `str`. All numeric indicators are correctly stored as `float64`
Duplicates	0 duplicate rows found

Key Observations:

The dataset is relatively clean with no structural issues (no duplicates, correct dtypes)
Missing data is modest (max 4.2%) but affects most columns
Only country and happiness_score are complete with no missing values
The missing values in indicator columns (3 each) may represent countries where certain metrics weren't measured or reported

#17

fill missing numeric values with column medians and drop columns with >50% missing

#18

Thinking

The user wants me to:

Fill missing numeric values with column medians
Drop columns with >50% missing values

Looking at the missing values summary from output-2, the highest missing percentage is 4.2% (region column). No column has >50% missing values, so no columns need to be dropped.

I should:

First check if any columns have >50% missing (none do based on current data)
Fill missing numeric values with their respective column medians

Let me write code to accomplish this in one step.

#19

Looking at the missing values summary, no column has >50% missing data (max is 4.2% for region). So no columns need to be dropped.

I'll now fill the missing numeric values with column medians:

#20

Fill missing numeric values

# Fill missing numeric values with column medians
numeric_cols = df.select_dtypes(include=['float64']).columns

for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())

# Verify missing values are filled
print("Missing values after filling:")
df.isnull().sum()

#21

Missing values after filling:

#22

country                         0
region                          6
happiness_score                 0
gdp_per_capita                  0
social_support                  0
healthy_life_expectancy         0
freedom_to_make_life_choices    0
generosity                      0
perceptions_of_corruption       0
dtype: int64

#23

Thinking

The user's goal was to "fill missing numeric values with column medians and drop columns with >50% missing".

Looking at the latest outputs:

Cell-4 output shows that all numeric columns now have 0 missing values after filling with medians
The region column still has 6 missing values (it's a string column, not numeric, so it wasn't filled)
No columns were dropped because none had >50% missing data (max was 4.2% for region)

The goal has been completed:

Missing numeric values have been filled with column medians (verified in output-4)
No columns needed to be dropped since none exceeded 50% missing threshold

This is sufficient to provide a final answer. I should choose Path A.

#24

The data cleaning task is complete based on the latest outputs:

Results:

All 7 numeric columns (happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption) now have 0 missing values after median imputation
No columns were dropped because the maximum missing percentage was 4.2% (region column), well below the 50% threshold
The region column still has 6 missing values (4.2%) since it's a text column and wasn't part of the numeric imputation

The dataset is now ready for analysis with complete numeric features.

#25

rename columns to snake_case and convert any string numbers to float

#26

Thinking

The user wants to:

Rename columns to snake_case
Convert any string numbers to float

Looking at the current columns from output-3:

country (str) - already snake_case
region (str) - already snake_case
happiness_score (float64) - already snake_case
gdp_per_capita (float64) - already snake_case
social_support (float64) - already snake_case
healthy_life_expectancy (float64) - already snake_case
freedom_to_make_life_choices (float64) - already snake_case
generosity (float64) - already snake_case
perceptions_of_corruption (float64) - already snake_case

All columns are already in snake_case! But I should still write code to handle this systematically in case there are any columns that need conversion.

For converting string numbers to float, I need to check if any columns that should be numeric are stored as strings. Looking at the dtypes, all numeric columns are already float64, and the string columns (country, region) are categorical text that shouldn't be converted to float.

Let me write code that:

Renames columns to snake_case (using a helper function)
Converts any string columns that contain numeric data to float

I'll do this in one step since both tasks are related to data type/column cleanup.

#27

Looking at the current columns, they're already in snake_case format. However, I'll write code to handle both tasks systematically - renaming any columns that need it and converting string numbers to float.

#28

Notebook code

import re

# Function to convert column names to snake_case
def to_snake_case(name):
    # Replace spaces with underscores
    name = name.replace(' ', '_')
    # Insert underscore before uppercase letters and convert to lowercase
    name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    name = re.sub('([a-z0-9])([A-Z])', r'\1_\2', name)
    return name.lower()

# Rename columns to snake_case
df.columns = [to_snake_case(col) for col in df.columns]

# Convert string columns that contain numeric data to float
for col in df.columns:
    if df[col].dtype == 'object':
        # Try to convert to numeric, keep as string if it fails
        converted = pd.to_numeric(df[col], errors='ignore')
        if converted.dtype != 'object':
            df[col] = converted

# Display updated column names and dtypes
df.dtypes

#29

country                             str
region                              str
happiness_score                 float64
gdp_per_capita                  float64
social_support                  float64
healthy_life_expectancy         float64
freedom_to_make_life_choices    float64
generosity                      float64
perceptions_of_corruption       float64
dtype: object

#30

Thinking

Looking at the latest outputs from cell-5, I can see the results of the column renaming and string-to-float conversion task:

Column names: All columns are already in snake_case format (country, region, happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption)
Data types:
- country and region remain as str (object) - these are categorical text columns, not numeric strings
- All other columns are float64 - these were already numeric
Key insight: The code ran successfully but there were no actual changes needed because:
- Columns were already in snake_case
- No string columns contained numeric data that could be converted to float (the object columns are genuinely categorical text)

The original user goal has been addressed - the code to rename columns to snake_case and convert string numbers to float has been executed. The outputs show the final state of the dataframe with proper column naming and data types.

This is sufficient to provide a final answer (Path A) since we can see the complete results of both operations.

#31

Based on the latest output from cell-5, both tasks have been completed:

Column Naming: All 9 columns are now in snake_case format (country, region, happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption)

Data Type Conversion: The code checked all object-type columns for numeric conversion. The country and region columns remain as str because they contain genuine categorical text data (country names and region names), not string-formatted numbers. All numeric columns are already float64.

#32

show a before/after summary of the cleaned dataset

#33

Thinking

The user wants a before/after summary of the cleaned dataset. I should create a clear comparison showing what changed during the data cleaning process. Let me compile information from the notebook state about:

Original state (from cell-1, cell-2, cell-3 outputs)
Final state (from cell-5 output)

I'll create a summary table showing the key changes:

Column names (before/after snake_case conversion)
Data types
Missing values (before/after)
Any other transformations

Let me write code to display this comparison clearly.

#34

Summarize data cleaning results

# Create a before/after summary of the data cleaning process

# Original state (from initial load)
original_info = {
    'shape': (143, 9),
    'missing_total': 24,  # 6 region + 3*6 numeric columns
    'columns': ['country', 'region', 'happiness_score', 'gdp_per_capita', 
                'social_support', 'healthy_life_expectancy', 
                'freedom_to_make_life_choices', 'generosity', 'perceptions_of_corruption'],
    'string_numeric_cols': 'None (all numeric columns already float64)'
}

# Final state (after cleaning)
final_info = {
    'shape': df.shape,
    'missing_total': df.isnull().sum().sum(),
    'columns': list(df.columns),
    'string_numeric_cols': 'None converted (country/region are categorical text)'
}

# Display comparison
print("=" * 60)
print("DATA CLEANING SUMMARY: BEFORE vs AFTER")
print("=" * 60)

print("\n📊 DATASET SHAPE")
print(f"   Before: {original_info['shape']}")
print(f"   After:  {final_info['shape']}")

print("\n📝 COLUMN NAMES (snake_case conversion)")
print(f"   Before: {original_info['columns']}")
print(f"   After:  {final_info['columns']}")
print(f"   Change: Columns were already in snake_case format")

print("\n🔢 MISSING VALUES")
print(f"   Before: {original_info['missing_total']} total missing values")
print(f"   After:  {final_info['missing_total']} total missing values")
print(f"   Action: Numeric columns filled with median values")
print(f"   Note:   6 'region' values remain missing (categorical data)")

print("\n📈 DATA TYPES")
print(f"   Before: {original_info['string_numeric_cols']}")
print(f"   After:  {final_info['string_numeric_cols']}")

print("\n✅ CLEANING ACTIONS COMPLETED")
print("   1. Verified column names are in snake_case")
print("   2. Checked for string-formatted numbers (none found)")
print("   3. Filled missing numeric values with column medians")
print("   4. Dataset ready for analysis/modeling")

# Show final dataset sample
print("\n" + "=" * 60)
print("FINAL DATASET PREVIEW (first 5 rows)")
print("=" * 60)
df.head()

#35

============================================================
DATA CLEANING SUMMARY: BEFORE vs AFTER
============================================================

📊 DATASET SHAPE
   Before: (143, 9)
   After:  (143, 9)

📝 COLUMN NAMES (snake_case conversion)
   Before: ['country', 'region', 'happiness_score', 'gdp_per_capita', 'social_support', 'healthy_life_expectancy', 'freedom_to_make_life_choices', 'generosity', 'perceptions_of_corruption']
   After:  ['country', 'region', 'happiness_score', 'gdp_per_capita', 'social_support', 'healthy_life_expectancy', 'freedom_to_make_life_choices', 'generosity', 'perceptions_of_corruption']
   Change: Columns were already in snake_case format

🔢 MISSING VALUES
   Before: 24 total missing values
   After:  6 total missing values
   Action: Numeric columns filled with median values
   Note:   6 'region' values remain missing (categorical data)

📈 DATA TYPES
   Before: None (all numeric columns already float64)
   After:  None converted (country/region are categorical text)

✅ CLEANING ACTIONS COMPLETED
   1. Verified column names are in snake_case
   2. Checked for string-formatted numbers (none found)
   3. Filled missing numeric values with column medians
   4. Dataset ready for analysis/modeling

============================================================
FINAL DATASET PREVIEW (first 5 rows)
============================================================

#36

	country	region	happiness_score	gdp_per_capita	social_support	healthy_life_expectancy	freedom_to_make_life_choices	generosity	perceptions_of_corruption
0	Finland	Western Europe	7.741	1.844	1.572	0.695	0.859	0.142	0.546
1	Denmark	Western Europe	7.583	1.908	1.520	0.699	0.823	0.204	0.548
2	Iceland	Western Europe	7.525	1.881	1.617	0.718	0.819	0.258	0.182
3	Sweden	Western Europe	7.344	1.878	1.501	0.724	0.838	0.221	0.524
4	Israel	Middle East and North Africa	7.341	1.803	1.513	0.740	0.641	0.153	0.193

#37

Thinking

The user's original goal was to "show a before/after summary of the cleaned dataset". Looking at the latest notebook output (cell-6), I can see that a comprehensive before/after summary has already been generated and displayed, including:

Dataset shape comparison (before: 143x9, after: 143x9)
Column names comparison (showing they were already in snake_case)
Missing values comparison (before: 24 total, after: 6 total - with explanation that 6 region values remain missing as categorical data)
Data types comparison
List of cleaning actions completed
Final dataset preview with df.head()

The output clearly shows all the relevant before/after information the user requested. The summary is complete and answers the original goal. This is Path A - the outputs are sufficient to answer the original goal.

I should provide a natural final answer based on the observed outputs, include , and not include any Python code block.

#38

The before/after summary has been successfully generated and displayed in the latest output. Here's what the cleaning process accomplished:

Dataset Shape: Remained consistent at 143 rows × 9 columns

Column Names: All columns were already in snake_case format (country, region, happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption)

Missing Values:

Before: 24 total missing values (6 region + 18 numeric column values)
After: 6 total missing values (only region column remains with missing categorical data)
Action taken: All numeric columns filled with median values

Data Types:

No string-formatted numbers were found that needed conversion
country and region remain as str (categorical text data)
All 7 numeric columns are float64

Cleaning Actions Completed:

Verified column names are in snake_case
Checked for string-formatted numbers (none found)
Filled missing numeric values with column medians
Dataset is now ready for analysis/modeling

The final dataset preview shows clean data with proper formatting, starting with Finland, Denmark, Iceland, Sweden, and Israel as the top 5 countries by happiness score.

Score cards for qwen3.5:397b for Data Cleaning with Pandas in Python

Total score: 5/10

Task Completion: 1/2
Needs work
The workflow completed the initial inspection and median-filling step, but it did not actually drop any columns or clearly produce a full cleaned before/after summary with null counts reduced to 0. The final dataset still has missing values in region.
Execution Correctness: 1/2
Needs work
The code is mostly runnable, but it contains weak/incorrect logic such as using `errors='ignore'` for numeric conversion and only selecting float64 columns for imputation. It also leaves missing values in a non-numeric column without addressing the requested cleaning outcome.
Output Quality: 2/3
Good
It correctly reports shape, missing counts, duplicates, and shows a before/after summary plus a final preview. However, the expected outcome of fully cleaned null counts is not met because region remains missing.
Reasoning Quality: 1/2
Needs work
The reasoning is partly correct in identifying no columns above the 50% missing threshold and noting no duplicates, but it incorrectly treats the task as complete despite remaining missing values. The explanation is inconsistent about whether the cleaning goal was fully achieved.
Reliability: 0/1
Needs work
The workflow is fragile and internally inconsistent, with repeated speculative commentary and a final summary that contradicts the actual output. It does not robustly satisfy the stated cleaning requirements.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Data Cleaning with Pandas in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with qwen3.5:397b

Score cards for qwen3.5:397b for Data Cleaning with Pandas in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics