Data Cleaning with Pandas in Python

Clean a messy real-world dataset: handle missing values, fix dtypes, remove duplicates, and standardize columns using pandas and an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the World Happiness Report 2024 CSV and profiles data quality issues such as missing values, incorrect dtypes, and duplicate rows. It cleans the dataset by dropping columns with more than 50% missingness, filling remaining numeric nulls with column medians, and converting numeric-looking strings to floats. It standardizes column names to snake_case and produces a before/after summary comparing shape, dtypes, and null counts.

Who this example is for

This is for analysts and data scientists who need a repeatable pandas-based cleaning routine for a real-world CSV. It helps anyone preparing data for downstream analysis by making cleaning steps explicit and verifiable with before/after checks.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Data quality report covering missing values, dtype problems, and duplicate rows
  • Columns with >50% missing values removed
  • Numeric missing values filled with column medians
  • Snake_case column names and corrected numeric dtypes
  • Before/after comparison of dataset shape and null counts

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • python
  • jupyter

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv and identify data quality issues: missing values, wrong dtypes, duplicates
  2. 2fill missing numeric values with column medians and drop columns with >50% missing
  3. 3rename columns to snake_case and convert any string numbers to float
  4. 4show a before/after summary of the cleaned dataset

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gpt-5.48/102/22/22/31/21/1Open gpt-5.4 conversation
glm-5.17/101/22/22/31/21/1Open glm-5.1 conversation
gpt-oss:120b6/102/21/22/31/20/1Open gpt-oss:120b conversation
gemma4:31b5/101/21/22/31/20/1Open gemma4:31b conversation
qwen3.5:397b5/101/21/22/31/20/1Open qwen3.5:397b conversation
qwen3-coder-next3/101/20/21/31/20/1Open qwen3-coder-next conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Data Cleaning with Pandas in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load the dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv and identify data quality issues: missing values, wrong dtypes, duplicates
Thinking

Inspecting dataset structure

I need to show one table per step by default, but the rule says to load and inspect the dataset first. I'll show the shape, columns, and the head of the DataFrame. It seems like multiple outputs are okay here. I can use display to show the columns and the head instead of print. My final expression might be df.head() and the shape printed separately. I’ll focus on crafting a concise output for clarity!

Let’s start by loading the CSV and checking the basic structure first. After that, we can inspect missing values, dtypes, and duplicates in the next step.

Load and inspect dataset
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column_name": df.columns}))
df.head()
rows columns
0 143 9
column_name
0 country
1 region
2 happiness_score
3 gdp_per_capita
4 social_support
5 healthy_life_expectancy
6 freedom_to_make_life_choices
7 generosity
8 perceptions_of_corruption
country region happiness_score gdp_per_capita social_support healthy_life_expectancy freedom_to_make_life_choices generosity perceptions_of_corruption
0 Finland Western Europe 7.741 1.844 1.572 0.695 0.859 0.142 0.546
1 Denmark Western Europe 7.583 1.908 1.520 0.699 0.823 0.204 0.548
2 Iceland Western Europe 7.525 1.881 1.617 0.718 0.819 0.258 0.182
3 Sweden Western Europe 7.344 1.878 1.501 0.724 0.838 0.221 0.524
4 Israel Middle East and North Africa 7.341 1.803 1.513 0.740 0.641 0.153 0.193

From the current output, we can already confirm the dataset has 143 rows and 9 columns, and the visible sample suggests a tidy tabular structure with numeric indicator columns plus two text columns (country, region). But that is not enough yet to identify the requested data quality issues, because we still need explicit checks for:

  • missing values
  • actual dtypes
  • duplicate rows
Display missing values and duplicates
from IPython.display import display

display(df.isna().sum().rename("missing_values").to_frame())
display(df.dtypes.rename("dtype").to_frame())
pd.DataFrame({
    "duplicate_rows": [df.duplicated().sum()],
    "duplicate_countries": [df.duplicated(subset=["country"]).sum()]
})
missing_values
country 0
region 6
happiness_score 0
gdp_per_capita 3
social_support 3
healthy_life_expectancy 3
freedom_to_make_life_choices 3
generosity 3
perceptions_of_corruption 3
dtype
country str
region str
happiness_score float64
gdp_per_capita float64
social_support float64
healthy_life_expectancy float64
freedom_to_make_life_choices float64
generosity float64
perceptions_of_corruption float64
duplicate_rows duplicate_countries
0 0 0

Based on the observed outputs, the main data quality findings are:

  • Missing values are present.

    • region has 6 missing values.
    • Each of these numeric columns has 3 missing values:
      • gdp_per_capita
      • social_support
      • healthy_life_expectancy
      • freedom_to_make_life_choices
      • generosity
      • perceptions_of_corruption
    • country and happiness_score have no missing values.
  • Dtypes look appropriate.

    • country and region are text/string columns.
    • All score/indicator columns are float64.
    • From the current evidence, there are no obvious wrong dtypes.
  • No duplicates were detected.

    • 0 fully duplicated rows.
    • 0 duplicated values in country, so each country appears once in this file.

So the dataset is mostly clean, with the only clear data quality issue being missing values, especially in region and several numeric indicators. There is no evidence of wrong dtypes or duplicates from the current checks.

fill missing numeric values with column medians and drop columns with >50% missing

No columns exceed 50% missing in the current data, so this step will just keep all columns and fill the numeric gaps with each column’s median.

Clean missing values
missing_ratio = df.isna().mean()

cols_to_drop = missing_ratio[missing_ratio > 0.5].index.tolist()
df_clean = df.drop(columns=cols_to_drop).copy()

num_cols = df_clean.select_dtypes(include="number").columns
df_clean[num_cols] = df_clean[num_cols].fillna(df_clean[num_cols].median())

df_clean.isna().sum().rename("missing_after_cleaning").to_frame()
missing_after_cleaning
country 0
region 6
happiness_score 0
gdp_per_capita 0
social_support 0
healthy_life_expectancy 0
freedom_to_make_life_choices 0
generosity 0
perceptions_of_corruption 0
Thinking

Analyzing response path

I think response path A is likely sufficient. I need to analyze the latest outputs against the goal of filling in missing numeric values with medians while dropping columns with more than 50% missing data. Since no columns were dropped, that’s positive. All the numeric missing values have been filled, but there are still six missing in the region, which is okay because those are non-numeric. My goal is only related to numeric data. I should include <!--final answer--> and keep it concise without any code.

The latest output shows the requested cleaning was applied correctly:

  • No columns were dropped because none had more than 50% missing values.
    From the earlier missingness check, the highest missing share was region with 6 missing out of 143 rows, which is far below the 50% threshold.

  • All numeric missing values were filled successfully.
    In df_clean, every numeric column now has 0 missing values:

    • happiness_score
    • gdp_per_capita
    • social_support
    • healthy_life_expectancy
    • freedom_to_make_life_choices
    • generosity
    • perceptions_of_corruption
  • region still has 6 missing values, which is expected from this step because your instruction was to fill only numeric missing values with medians. region is not numeric, so it was left unchanged.

So the cleaned dataset now keeps all columns, removes missingness from numeric fields, and preserves the non-numeric region column with its original missing values.

rename columns to snake_case and convert any string numbers to float
Thinking

Formulating Data Transformation Steps

I'm thinking about how to respond to the user’s request to rename columns to snake_case and convert string numbers to float in their dataframe. It looks like most columns are already in snake_case, except for country and region. I’ll make sure to use one step of code that incorporates those changes while employing regex or pd.to_numeric. It’s essential to be careful not to mutate any non-numeric strings. I’ll brief the user on this process and provide the code block.Structuring Data Transformations

I’m planning to write code that renames columns in snake_case and converts any numeric-like string columns. Most columns are already formatted, so there won’t be much change. I think I should show the data types afterward in one table rather than multiple displays, which the user prefers. I'll create a new DataFrame named df_tidy, and ensure the final expression is simple. I’ll also implement the snake_case transformation robustly using a helper function to handle the formatting correctly.Normalizing Object Columns

I need to convert object columns by normalizing the strings—specifically, I can strip commas (like "1,234") using str.replace(",", "") and str.strip(). However, I’ll have to be careful with country and region; using to_numeric can result in NaN values. I should check if all non-null values can convert, ensuring consistency. I’ll store the converted column list as a comment, but ultimately, the output will just show data types. I’ll operate on df_clean since it exists in the notebook.

This will standardize the column names and then safely convert any fully numeric-looking text columns to float without changing real text fields like country or region.

Standardize columns and types
import re

def to_snake_case(name):
    name = re.sub(r"[^0-9a-zA-Z]+", "_", str(name))
    name = re.sub(r"_+", "_", name).strip("_")
    return name.lower()

df_tidy = df_clean.copy()
df_tidy.columns = [to_snake_case(col) for col in df_tidy.columns]

obj_cols = df_tidy.select_dtypes(include=["object", "string"]).columns

for col in obj_cols:
    s = df_tidy[col].astype("string").str.strip().str.replace(",", "", regex=False)
    converted = pd.to_numeric(s, errors="coerce")
    if s.notna().sum() > 0 and converted.notna().sum() == s.notna().sum():
        df_tidy[col] = converted.astype(float)

df_tidy.dtypes.rename("dtype").to_frame()
dtype
country str
region str
happiness_score float64
gdp_per_capita float64
social_support float64
healthy_life_expectancy float64
freedom_to_make_life_choices float64
generosity float64
perceptions_of_corruption float64

The latest output supports the requested transformation.

  • The column names shown in the dtype table are already in snake_case, such as happiness_score, gdp_per_capita, healthy_life_expectancy, and perceptions_of_corruption.
  • The text columns country and region remain as string type, which is appropriate because they are categorical text, not numeric text.
  • All measurement columns are float64, so there were no string-number columns left to convert among the object/string fields. The conversion logic correctly avoided forcing non-numeric text columns into floats.

So based on the observed notebook state, the renaming worked, and there were no numeric-looking string columns that needed conversion beyond what was already numeric.

show a before/after summary of the cleaned dataset
Thinking

Creating summary tables

I need to prepare a before/after summary of the cleaned dataset for the user. I should compare the original dataframe and the cleaned one, likely focusing on things like rows, columns, missing values, and duplicates. A compact summary table will work well. I can create a DataFrame to compare the original and cleaned datasets, noting any changes in column names and whether snake_case is used. I'll keep it concise, with just one code block for clarity.Summarizing data types and columns

I need to use df and df_tidy from the notebook, ensuring that my string dtype count includes both object and string types. Pandas dtypes display as str, but I want to confirm this by using select_dtypes(include=["object", "string"]). There’s a chance it won’t capture the Python str correctly, so maybe the safest route is to use pd.api.types.is_string_dtype.

I should also summarize columns with renamed counts and check for any that changed names, though since they're in snake_case already, it’ll likely be zero. Let's construct a clear summary for the user.

A compact before/after summary table is the clearest next step here.

Summarize datasets before after
def dataset_summary(frame, label):
    return {
        "dataset": label,
        "rows": frame.shape[0],
        "columns": frame.shape[1],
        "total_missing": int(frame.isna().sum().sum()),
        "duplicate_rows": int(frame.duplicated().sum()),
        "string_columns": sum(pd.api.types.is_string_dtype(frame[c]) for c in frame.columns),
        "numeric_columns": sum(pd.api.types.is_numeric_dtype(frame[c]) for c in frame.columns),
        "snake_case_columns": sum(
            c == re.sub(r"_+", "_", re.sub(r"[^0-9a-zA-Z]+", "_", str(c))).strip("_").lower()
            for c in frame.columns
        ),
    }

before_after = pd.DataFrame([
    dataset_summary(df, "before_cleaning"),
    dataset_summary(df_tidy, "after_cleaning"),
])

before_after
dataset rows columns total_missing duplicate_rows string_columns numeric_columns snake_case_columns
0 before_cleaning 143 9 24 0 2 7 9
1 after_cleaning 143 9 6 0 2 7 9

Here’s the before/after summary based on the notebook output:

  • The dataset stayed the same size: 143 rows and 9 columns before and after cleaning.
  • Missing values dropped from 24 to 6.
  • Duplicate rows stayed at 0.
  • Column type mix stayed the same: 2 string columns and 7 numeric columns.
  • All 9 columns are in snake_case both before and after, so the renaming step did not change names further in this file.

What changed in practice:

  • Numeric missing values were filled, which removed all missingness from the 7 numeric columns.
  • The remaining 6 missing values are in region, a string column that was not imputed.
  • No object/string columns contained fully numeric text, so no additional string-to-float conversions were needed.

A compact interpretation:

  • Before cleaning: complete schema, but some missing values in numeric fields and region
  • After cleaning: numeric data is complete, schema unchanged, only region still has missing values

Score cards for gpt-5.4 for Data Cleaning with Pandas in Python

Total score: 8/10

  • Task Completion: 2/2

    Excellent

    The workflow covers the requested checks, cleaning, renaming, dtype conversion logic, and a before/after summary. It also reports missing values, dtypes, duplicates, and shape changes.

  • Execution Correctness: 2/2

    Excellent

    The code is coherent and likely runnable: it loads the CSV, computes missingness, fills numeric NaNs with medians, renames columns, and builds a summary table without obvious syntax issues.

  • Output Quality: 2/3

    Good

    The outputs include the required quality report and before/after comparison, but the final cleaned dataset still has 6 missing values in `region`, so the null counts were not reduced to 0 overall. The semantic cleaning goals are mostly met, but not fully.

  • Reasoning Quality: 1/2

    Needs work

    The reasoning is mostly correct about missing values, duplicates, and dtypes, but it overstates completeness by implying the dataset is fully cleaned while non-numeric missing values remain. The explanation is helpful but somewhat shallow.

  • Reliability: 1/1

    Excellent

    The workflow is reasonably consistent and avoids forcing text columns into numeric types, but it relies on assumptions about what counts as wrong dtypes and does not fully eliminate missingness. It is usable, though not fully robust to the stated outcome.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.