Data Cleaning with Pandas in Python

Clean a messy real-world dataset: handle missing values, fix dtypes, remove duplicates, and standardize columns using pandas and an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the World Happiness Report 2024 CSV and profiles data quality issues such as missing values, incorrect dtypes, and duplicate rows. It cleans the dataset by dropping columns with more than 50% missingness, filling remaining numeric nulls with column medians, and converting numeric-looking strings to floats. It standardizes column names to snake_case and produces a before/after summary comparing shape, dtypes, and null counts.

Who this example is for

This is for analysts and data scientists who need a repeatable pandas-based cleaning routine for a real-world CSV. It helps anyone preparing data for downstream analysis by making cleaning steps explicit and verifiable with before/after checks.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Data quality report covering missing values, dtype problems, and duplicate rows
  • Columns with >50% missing values removed
  • Numeric missing values filled with column medians
  • Snake_case column names and corrected numeric dtypes
  • Before/after comparison of dataset shape and null counts

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • python
  • jupyter

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv and identify data quality issues: missing values, wrong dtypes, duplicates
  2. 2fill missing numeric values with column medians and drop columns with >50% missing
  3. 3rename columns to snake_case and convert any string numbers to float
  4. 4show a before/after summary of the cleaned dataset

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gpt-5.48/102/22/22/31/21/1Open gpt-5.4 conversation
glm-5.17/101/22/22/31/21/1Open glm-5.1 conversation
gpt-oss:120b6/102/21/22/31/20/1Open gpt-oss:120b conversation
gemma4:31b5/101/21/22/31/20/1Open gemma4:31b conversation
qwen3.5:397b5/101/21/22/31/20/1Open qwen3.5:397b conversation
qwen3-coder-next3/101/20/21/31/20/1Open qwen3-coder-next conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3.5:397b

This run uses AI with a step-by-step list of prompts to perform the analysis task Data Cleaning with Pandas in Python. Below, you can review the full conversation generated with qwen3.5:397b. After the conversation, we show detailed score cards for this run.

load the dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv and identify data quality issues: missing values, wrong dtypes, duplicates
Thinking

The user wants me to load a dataset from a URL and identify data quality issues. I'll start by loading the data and doing an initial inspection - showing shape, columns, dtypes, and checking for missing values and duplicates.

Let me write a single code block that:

  1. Loads the CSV
  2. Shows basic info (shape, columns, dtypes)
  3. Checks for missing values
  4. Checks for duplicates

I'll keep this focused on data inspection only, as per the guidelines.

I'll load the dataset and perform an initial data quality inspection.

Load and inspect dataset
import pandas as pd

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv')

# Basic info
print("Dataset shape:", df.shape)
print("\nColumn dtypes:")
df.dtypes

# Check missing values
print("\nMissing values:")
df.isnull().sum()

# Check duplicates
print("\nDuplicate rows:", df.duplicated().sum())
Dataset shape: (143, 9)

Column dtypes:

Missing values:

Duplicate rows: 0
Thinking

Looking at the notebook outputs, I can see the dataset was loaded successfully with shape (143, 9). However, the output shows "Column dtypes:", "Missing values:", and "Duplicate rows: 0" but the actual dtype and missing value information is not displayed in the output - it appears the code printed the labels but the actual Series/DataFrame values weren't shown properly.

The issue is that in the code, df.dtypes and df.isnull().sum() are bare expressions at the end of print statements, but they're not being displayed. The code structure has issues - the print statements don't actually show the dtype and missing value data.

I need to fix this to properly display:

  1. Column dtypes
  2. Missing values count and percentage
  3. Duplicate count (this one shows 0)

Let me provide a corrected code block that will properly display all this information.

The dataset loaded with 143 rows and 9 columns. The duplicate check shows 0 duplicate rows. However, the dtype and missing value information wasn't properly displayed in the output. Let me fix this to show the complete data quality assessment.

Display column types and missing values
# Display column dtypes
print("Column Dtypes:")
df.dtypes

# Display missing values with percentages
print("\nMissing Values:")
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing_Count': missing, 'Missing_Percent': missing_pct})
missing_df[missing_df['Missing_Count'] > 0]
Column Dtypes:

Missing Values:
Missing_Count Missing_Percent
region 6 4.2
gdp_per_capita 3 2.1
social_support 3 2.1
healthy_life_expectancy 3 2.1
freedom_to_make_life_choices 3 2.1
generosity 3 2.1
perceptions_of_corruption 3 2.1
Thinking

Looking at the latest outputs, I have good information about the data quality issues:

  1. Dataset shape: 143 rows, 9 columns

  2. Duplicates: 0 duplicate rows (no issue)

  3. Missing values: Found in 7 columns:

    • region: 6 missing (4.2%)
    • gdp_per_capita: 3 missing (2.1%)
    • social_support: 3 missing (2.1%)
    • healthy_life_expectancy: 3 missing (2.1%)
    • freedom_to_make_life_choices: 3 missing (2.1%)
    • generosity: 3 missing (2.1%)
    • perceptions_of_corruption: 3 missing (2.1%)
  4. Dtypes: The dtype output wasn't properly displayed in the notebook output (shows empty after "Column Dtypes:"). I need to verify the dtypes to complete the data quality assessment.

The original goal was to identify: missing values ✓, wrong dtypes ?, duplicates ✓

I have information on missing values and duplicates, but the dtype information wasn't properly captured in the output. I should verify the dtypes to complete the assessment. This is Path B - I need one more minimal step to show the dtypes properly.

Based on the latest outputs, I can see:

Missing Values: 7 columns have missing data - region has the most with 6 missing values (4.2%), while 6 other columns (gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption) each have 3 missing values (2.1%).

Duplicates: No duplicate rows found (0 duplicates).

Dtypes: The dtype output wasn't captured properly. I need to verify this to complete the assessment.

Show column data types
# Show column dtypes clearly
df.dtypes
country                             str
region                              str
happiness_score                 float64
gdp_per_capita                  float64
social_support                  float64
healthy_life_expectancy         float64
freedom_to_make_life_choices    float64
generosity                      float64
perceptions_of_corruption       float64
dtype: object
Thinking

Looking at the latest outputs from the notebook, I have all the information needed to answer the original user goal:

Original Goal: Load the dataset and identify data quality issues: missing values, wrong dtypes, duplicates

What I've gathered from outputs:

  1. Dataset loaded: 143 rows, 9 columns ✓

  2. Missing values (from output-2):

    • region: 6 missing (4.2%)
    • 6 other columns have 3 missing each (2.1%): gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption
    • country and happiness_score have no missing values
  3. Dtypes (from output-3):

    • country, region: str (appropriate for categorical text data)
    • All numeric columns (happiness_score and the 6 indicators): float64 (appropriate for continuous numeric data)
    • No wrong dtypes detected
  4. Duplicates (from output-1): 0 duplicate rows ✓

All three data quality aspects requested have been assessed. I can now provide a final answer summarizing the findings.

Based on the notebook outputs, I have all the information needed to complete the data quality assessment:

Data Quality Summary for WHR_2024.csv (143 rows, 9 columns):

IssueFinding
Missing Values7 of 9 columns have missing data. region has the most with 6 missing values (4.2%). Six indicator columns each have 3 missing values (2.1%): gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption
Wrong DtypesNone detected. Text columns (country, region) are correctly stored as str. All numeric indicators are correctly stored as float64
Duplicates0 duplicate rows found

Key Observations:

  • The dataset is relatively clean with no structural issues (no duplicates, correct dtypes)
  • Missing data is modest (max 4.2%) but affects most columns
  • Only country and happiness_score are complete with no missing values
  • The missing values in indicator columns (3 each) may represent countries where certain metrics weren't measured or reported
fill missing numeric values with column medians and drop columns with >50% missing
Thinking

The user wants me to:

  1. Fill missing numeric values with column medians
  2. Drop columns with >50% missing values

Looking at the missing values summary from output-2, the highest missing percentage is 4.2% (region column). No column has >50% missing values, so no columns need to be dropped.

I should:

  1. First check if any columns have >50% missing (none do based on current data)
  2. Fill missing numeric values with their respective column medians

Let me write code to accomplish this in one step.

Looking at the missing values summary, no column has >50% missing data (max is 4.2% for region). So no columns need to be dropped.

I'll now fill the missing numeric values with column medians:

Fill missing numeric values
# Fill missing numeric values with column medians
numeric_cols = df.select_dtypes(include=['float64']).columns

for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())

# Verify missing values are filled
print("Missing values after filling:")
df.isnull().sum()
Missing values after filling:
country                         0
region                          6
happiness_score                 0
gdp_per_capita                  0
social_support                  0
healthy_life_expectancy         0
freedom_to_make_life_choices    0
generosity                      0
perceptions_of_corruption       0
dtype: int64
Thinking

The user's goal was to "fill missing numeric values with column medians and drop columns with >50% missing".

Looking at the latest outputs:

  1. Cell-4 output shows that all numeric columns now have 0 missing values after filling with medians
  2. The region column still has 6 missing values (it's a string column, not numeric, so it wasn't filled)
  3. No columns were dropped because none had >50% missing data (max was 4.2% for region)

The goal has been completed:

  • Missing numeric values have been filled with column medians (verified in output-4)
  • No columns needed to be dropped since none exceeded 50% missing threshold

This is sufficient to provide a final answer. I should choose Path A.

The data cleaning task is complete based on the latest outputs:

Results:

  • All 7 numeric columns (happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption) now have 0 missing values after median imputation
  • No columns were dropped because the maximum missing percentage was 4.2% (region column), well below the 50% threshold
  • The region column still has 6 missing values (4.2%) since it's a text column and wasn't part of the numeric imputation

The dataset is now ready for analysis with complete numeric features.

rename columns to snake_case and convert any string numbers to float
Thinking

The user wants to:

  1. Rename columns to snake_case
  2. Convert any string numbers to float

Looking at the current columns from output-3:

  • country (str) - already snake_case
  • region (str) - already snake_case
  • happiness_score (float64) - already snake_case
  • gdp_per_capita (float64) - already snake_case
  • social_support (float64) - already snake_case
  • healthy_life_expectancy (float64) - already snake_case
  • freedom_to_make_life_choices (float64) - already snake_case
  • generosity (float64) - already snake_case
  • perceptions_of_corruption (float64) - already snake_case

All columns are already in snake_case! But I should still write code to handle this systematically in case there are any columns that need conversion.

For converting string numbers to float, I need to check if any columns that should be numeric are stored as strings. Looking at the dtypes, all numeric columns are already float64, and the string columns (country, region) are categorical text that shouldn't be converted to float.

Let me write code that:

  1. Renames columns to snake_case (using a helper function)
  2. Converts any string columns that contain numeric data to float

I'll do this in one step since both tasks are related to data type/column cleanup.

Looking at the current columns, they're already in snake_case format. However, I'll write code to handle both tasks systematically - renaming any columns that need it and converting string numbers to float.

Notebook code
import re

# Function to convert column names to snake_case
def to_snake_case(name):
    # Replace spaces with underscores
    name = name.replace(' ', '_')
    # Insert underscore before uppercase letters and convert to lowercase
    name = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    name = re.sub('([a-z0-9])([A-Z])', r'\1_\2', name)
    return name.lower()

# Rename columns to snake_case
df.columns = [to_snake_case(col) for col in df.columns]

# Convert string columns that contain numeric data to float
for col in df.columns:
    if df[col].dtype == 'object':
        # Try to convert to numeric, keep as string if it fails
        converted = pd.to_numeric(df[col], errors='ignore')
        if converted.dtype != 'object':
            df[col] = converted

# Display updated column names and dtypes
df.dtypes
country                             str
region                              str
happiness_score                 float64
gdp_per_capita                  float64
social_support                  float64
healthy_life_expectancy         float64
freedom_to_make_life_choices    float64
generosity                      float64
perceptions_of_corruption       float64
dtype: object
Thinking

Looking at the latest outputs from cell-5, I can see the results of the column renaming and string-to-float conversion task:

  1. Column names: All columns are already in snake_case format (country, region, happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption)

  2. Data types:

    • country and region remain as str (object) - these are categorical text columns, not numeric strings
    • All other columns are float64 - these were already numeric
  3. Key insight: The code ran successfully but there were no actual changes needed because:

    • Columns were already in snake_case
    • No string columns contained numeric data that could be converted to float (the object columns are genuinely categorical text)

The original user goal has been addressed - the code to rename columns to snake_case and convert string numbers to float has been executed. The outputs show the final state of the dataframe with proper column naming and data types.

This is sufficient to provide a final answer (Path A) since we can see the complete results of both operations.

Based on the latest output from cell-5, both tasks have been completed:

Column Naming: All 9 columns are now in snake_case format (country, region, happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption)

Data Type Conversion: The code checked all object-type columns for numeric conversion. The country and region columns remain as str because they contain genuine categorical text data (country names and region names), not string-formatted numbers. All numeric columns are already float64.

show a before/after summary of the cleaned dataset
Thinking

The user wants a before/after summary of the cleaned dataset. I should create a clear comparison showing what changed during the data cleaning process. Let me compile information from the notebook state about:

  1. Original state (from cell-1, cell-2, cell-3 outputs)
  2. Final state (from cell-5 output)

I'll create a summary table showing the key changes:

  • Column names (before/after snake_case conversion)
  • Data types
  • Missing values (before/after)
  • Any other transformations

Let me write code to display this comparison clearly.

Summarize data cleaning results
# Create a before/after summary of the data cleaning process

# Original state (from initial load)
original_info = {
    'shape': (143, 9),
    'missing_total': 24,  # 6 region + 3*6 numeric columns
    'columns': ['country', 'region', 'happiness_score', 'gdp_per_capita', 
                'social_support', 'healthy_life_expectancy', 
                'freedom_to_make_life_choices', 'generosity', 'perceptions_of_corruption'],
    'string_numeric_cols': 'None (all numeric columns already float64)'
}

# Final state (after cleaning)
final_info = {
    'shape': df.shape,
    'missing_total': df.isnull().sum().sum(),
    'columns': list(df.columns),
    'string_numeric_cols': 'None converted (country/region are categorical text)'
}

# Display comparison
print("=" * 60)
print("DATA CLEANING SUMMARY: BEFORE vs AFTER")
print("=" * 60)

print("\n📊 DATASET SHAPE")
print(f"   Before: {original_info['shape']}")
print(f"   After:  {final_info['shape']}")

print("\n📝 COLUMN NAMES (snake_case conversion)")
print(f"   Before: {original_info['columns']}")
print(f"   After:  {final_info['columns']}")
print(f"   Change: Columns were already in snake_case format")

print("\n🔢 MISSING VALUES")
print(f"   Before: {original_info['missing_total']} total missing values")
print(f"   After:  {final_info['missing_total']} total missing values")
print(f"   Action: Numeric columns filled with median values")
print(f"   Note:   6 'region' values remain missing (categorical data)")

print("\n📈 DATA TYPES")
print(f"   Before: {original_info['string_numeric_cols']}")
print(f"   After:  {final_info['string_numeric_cols']}")

print("\n✅ CLEANING ACTIONS COMPLETED")
print("   1. Verified column names are in snake_case")
print("   2. Checked for string-formatted numbers (none found)")
print("   3. Filled missing numeric values with column medians")
print("   4. Dataset ready for analysis/modeling")

# Show final dataset sample
print("\n" + "=" * 60)
print("FINAL DATASET PREVIEW (first 5 rows)")
print("=" * 60)
df.head()
============================================================
DATA CLEANING SUMMARY: BEFORE vs AFTER
============================================================

📊 DATASET SHAPE
   Before: (143, 9)
   After:  (143, 9)

📝 COLUMN NAMES (snake_case conversion)
   Before: ['country', 'region', 'happiness_score', 'gdp_per_capita', 'social_support', 'healthy_life_expectancy', 'freedom_to_make_life_choices', 'generosity', 'perceptions_of_corruption']
   After:  ['country', 'region', 'happiness_score', 'gdp_per_capita', 'social_support', 'healthy_life_expectancy', 'freedom_to_make_life_choices', 'generosity', 'perceptions_of_corruption']
   Change: Columns were already in snake_case format

🔢 MISSING VALUES
   Before: 24 total missing values
   After:  6 total missing values
   Action: Numeric columns filled with median values
   Note:   6 'region' values remain missing (categorical data)

📈 DATA TYPES
   Before: None (all numeric columns already float64)
   After:  None converted (country/region are categorical text)

✅ CLEANING ACTIONS COMPLETED
   1. Verified column names are in snake_case
   2. Checked for string-formatted numbers (none found)
   3. Filled missing numeric values with column medians
   4. Dataset ready for analysis/modeling

============================================================
FINAL DATASET PREVIEW (first 5 rows)
============================================================
country region happiness_score gdp_per_capita social_support healthy_life_expectancy freedom_to_make_life_choices generosity perceptions_of_corruption
0 Finland Western Europe 7.741 1.844 1.572 0.695 0.859 0.142 0.546
1 Denmark Western Europe 7.583 1.908 1.520 0.699 0.823 0.204 0.548
2 Iceland Western Europe 7.525 1.881 1.617 0.718 0.819 0.258 0.182
3 Sweden Western Europe 7.344 1.878 1.501 0.724 0.838 0.221 0.524
4 Israel Middle East and North Africa 7.341 1.803 1.513 0.740 0.641 0.153 0.193
Thinking

The user's original goal was to "show a before/after summary of the cleaned dataset". Looking at the latest notebook output (cell-6), I can see that a comprehensive before/after summary has already been generated and displayed, including:

  1. Dataset shape comparison (before: 143x9, after: 143x9)
  2. Column names comparison (showing they were already in snake_case)
  3. Missing values comparison (before: 24 total, after: 6 total - with explanation that 6 region values remain missing as categorical data)
  4. Data types comparison
  5. List of cleaning actions completed
  6. Final dataset preview with df.head()

The output clearly shows all the relevant before/after information the user requested. The summary is complete and answers the original goal. This is Path A - the outputs are sufficient to answer the original goal.

I should provide a natural final answer based on the observed outputs, include <!--final answer-->, and not include any Python code block.

The before/after summary has been successfully generated and displayed in the latest output. Here's what the cleaning process accomplished:

Dataset Shape: Remained consistent at 143 rows × 9 columns

Column Names: All columns were already in snake_case format (country, region, happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption)

Missing Values:

  • Before: 24 total missing values (6 region + 18 numeric column values)
  • After: 6 total missing values (only region column remains with missing categorical data)
  • Action taken: All numeric columns filled with median values

Data Types:

  • No string-formatted numbers were found that needed conversion
  • country and region remain as str (categorical text data)
  • All 7 numeric columns are float64

Cleaning Actions Completed:

  1. Verified column names are in snake_case
  2. Checked for string-formatted numbers (none found)
  3. Filled missing numeric values with column medians
  4. Dataset is now ready for analysis/modeling

The final dataset preview shows clean data with proper formatting, starting with Finland, Denmark, Iceland, Sweden, and Israel as the top 5 countries by happiness score.

Score cards for qwen3.5:397b for Data Cleaning with Pandas in Python

Total score: 5/10

  • Task Completion: 1/2

    Needs work

    The workflow completed the initial inspection and median-filling step, but it did not actually drop any columns or clearly produce a full cleaned before/after summary with null counts reduced to 0. The final dataset still has missing values in region.

  • Execution Correctness: 1/2

    Needs work

    The code is mostly runnable, but it contains weak/incorrect logic such as using `errors='ignore'` for numeric conversion and only selecting float64 columns for imputation. It also leaves missing values in a non-numeric column without addressing the requested cleaning outcome.

  • Output Quality: 2/3

    Good

    It correctly reports shape, missing counts, duplicates, and shows a before/after summary plus a final preview. However, the expected outcome of fully cleaned null counts is not met because region remains missing.

  • Reasoning Quality: 1/2

    Needs work

    The reasoning is partly correct in identifying no columns above the 50% missing threshold and noting no duplicates, but it incorrectly treats the task as complete despite remaining missing values. The explanation is inconsistent about whether the cleaning goal was fully achieved.

  • Reliability: 0/1

    Needs work

    The workflow is fragile and internally inconsistent, with repeated speculative commentary and a final summary that contradicts the actual output. It does not robustly satisfy the stated cleaning requirements.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.