Data Analysis

Data Cleaning with Pandas in Python

Clean a messy real-world dataset: handle missing values, fix dtypes, remove duplicates, and standardize columns using pandas and an AI data analyst.

What

This AI Data Analyst workflow loads the World Happiness Report 2024 CSV and profiles data quality issues such as missing values, incorrect dtypes, and duplicate rows. It cleans the dataset by dropping columns with more than 50% missingness, filling remaining numeric nulls with column medians, and converting numeric-looking strings to floats. It standardizes column names to snake_case and produces a before/after summary comparing shape, dtypes, and null counts.

Who

This is for analysts and data scientists who need a repeatable pandas-based cleaning routine for a real-world CSV. It helps anyone preparing data for downstream analysis by making cleaning steps explicit and verifiable with before/after checks.

Tools

  • pandas
  • numpy
  • python
  • jupyter

Outcomes

  • Data quality report covering missing values, dtype problems, and duplicate rows
  • Columns with >50% missing values removed
  • Numeric missing values filled with column medians
  • Snake_case column names and corrected numeric dtypes
  • Before/after comparison of dataset shape and null counts

Quality Score

9/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

  • Task Completion: 2/2

    Excellent

    Covers all requested steps: loads data, reports missing/dtypes/duplicates, drops >50% missing columns (none), imputes numeric medians, standardizes column names to snake_case, attempts numeric conversion, and provides before/after summary.

  • Execution Correctness: 2/2

    Excellent

    Code is coherent and likely runnable end-to-end: reads CSV, computes summaries, performs median imputation on numeric columns, renames columns, and conditionally converts object columns using pd.to_numeric without breaking non-numeric text.

  • Output Quality: 2/3

    Good

    Outputs include a quality table (missing counts/pct and dtypes), duplicate count, remaining missing after imputation, dtypes after renaming/conversion, and a before/after summary. However, expected outcome 'null counts reduced to 0' is not met (region still has 6 missing), and no explicit dtype issues are demonstrated beyond confirming none.

  • Reasoning Quality: 2/2

    Excellent

    Reasoning aligns with the evidence: correctly notes no >50% missing columns, numeric NaNs filled by medians, and remaining missingness is in non-numeric 'region'. Explains dtype assessment and conversion approach appropriately.

  • Reliability: 1/1

    Excellent

    Workflow is reasonably robust (median imputation limited to numeric columns; safe conditional conversion of object columns). Minor fragility: the 'nulls to 0' expectation is unmet because categorical missing values are not handled.

load the dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv and identify data quality issues: missing values, wrong dtypes, duplicates
Thinking

Loading and inspecting dataset

I need to take this step by step since the user wants to load a dataset and identify data quality issues like missing values, incorrect data types, and duplicates. First off, I have to load and inspect the dataset because it’s not yet in the notebook context. I’ll focus on showing the shape, columns, and first few rows in one code block.

I’ll ensure the code is minimal, displaying the shape and contents cleanly.Loading and displaying dataset overview

I’m considering how to inspect the dataset effectively. I could create a one-row DataFrame to show the shape and column count. Then, I could display the list of columns followed by the head of the DataFrame. I think it’s better to end with df.head() as my last expression because it shows the data.

I’ll use pd.DataFrame to display the shape and columns, making sure it’s all straightforward and satisfies the inspection needs.

Let’s start by loading the file and doing a quick inspection of its structure before checking quality issues.

Load and inspect dataset
import pandas as pd

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))
df.head()
rows columns
0 143 9
column
0 country
1 region
2 happiness_score
3 gdp_per_capita
4 social_support
5 healthy_life_expectancy
6 freedom_to_make_life_choices
7 generosity
8 perceptions_of_corruption
country region happiness_score gdp_per_capita social_support healthy_life_expectancy freedom_to_make_life_choices generosity perceptions_of_corruption
0 Finland Western Europe 7.741 1.844 1.572 0.695 0.859 0.142 0.546
1 Denmark Western Europe 7.583 1.908 1.520 0.699 0.823 0.204 0.548
2 Iceland Western Europe 7.525 1.881 1.617 0.718 0.819 0.258 0.182
3 Sweden Western Europe 7.344 1.878 1.501 0.724 0.838 0.221 0.524
4 Israel Middle East and North Africa 7.341 1.803 1.513 0.740 0.641 0.153 0.193

From the current output, we can already confirm the dataset structure: it has 143 rows and 9 columns, with country and region appearing to be text fields and the remaining columns looking numeric from the sample. But that is not enough yet to identify the actual data quality issues you asked for, because we still need explicit checks for missing values, dtypes, and duplicate rows.

Summarize data quality
quality = pd.DataFrame({
    "dtype": df.dtypes.astype(str),
    "missing_values": df.isna().sum(),
    "missing_pct": (df.isna().mean() * 100).round(2),
})

display(quality)
display(pd.DataFrame({
    "duplicate_rows": [df.duplicated().sum()],
    "duplicate_countries": [df["country"].duplicated().sum()]
}))
dtype missing_values missing_pct
country str 0 0.0
region str 6 4.2
happiness_score float64 0 0.0
gdp_per_capita float64 3 2.1
social_support float64 3 2.1
healthy_life_expectancy float64 3 2.1
freedom_to_make_life_choices float64 3 2.1
generosity float64 3 2.1
perceptions_of_corruption float64 3 2.1
duplicate_rows duplicate_countries
0 0 0
Thinking

Identifying data issues

I need to provide a final answer that includes insights, clarifying that the dtype for country and region appears as "str." The dtype usually shows as object unless it's related to pandas’ string dtype. There are missing values in region, and while dtypes look appropriate, some columns show numeric values as float64, and text as str. The user asked about issues like missing values, wrong dtypes, and duplicates. Overall, it seems there are missing values in region and six numeric columns.

From the observed checks:

  • Missing values are present

    • region: 6 missing values, which is 4.2% of the dataset.
    • Each of these numeric columns has 3 missing values (2.1% each):
      • gdp_per_capita
      • social_support
      • healthy_life_expectancy
      • freedom_to_make_life_choices
      • generosity
      • perceptions_of_corruption
    • country and happiness_score have no missing values.
  • Wrong dtypes do not appear to be an issue

    • country and region are text columns.
    • The score/indicator columns are float64, which is appropriate for continuous numeric data.
    • Based on this inspection, there are no obvious dtype problems.
  • Duplicates are not present

    • 0 exact duplicate rows
    • 0 duplicate country names

So the data quality issues found are mainly missing values, while dtypes look correct and duplicates were not found.

fill missing numeric values with column medians and drop columns with >50% missing
Thinking

Calculating missing data ratios

I need to calculate the missing data ratio and drop columns with more than 50% missing values. Since there are no columns exceeding that threshold, none will be dropped. I plan to copy the DataFrame and then identify numeric columns to fill in with medians. It could be useful to display the missing values after that. I think one summary table per step will keep things clear, showing before and after counts for each column. Let's make sure to generate a summary DataFrame reflecting this.Creating a summary DataFrame

I might consider showing dropped columns separately, but that could complicate things with another table, so I'll avoid that. Instead, I can create one summary DataFrame that includes columns for before_missing, after_missing, and the action taken. That sounds more streamlined! I want to ensure I have only one final expression to display, along with a clean code structure. Keeping things minimal in the code is key here. Let's get started on that!

No columns are above the 50% missing threshold in the current data, so this step will only fill missing values in numeric columns using each column’s median and keep all columns.

Clean missing values
missing_pct = df.isna().mean() * 100
drop_cols = missing_pct[missing_pct > 50].index.tolist()

df_clean = df.drop(columns=drop_cols).copy()

numeric_cols = df_clean.select_dtypes(include="number").columns
df_clean[numeric_cols] = df_clean[numeric_cols].fillna(df_clean[numeric_cols].median())

pd.DataFrame({
    "before_missing": df.isna().sum(),
    "after_missing": df_clean.reindex(columns=df.columns).isna().sum(),
    "dropped_50pct_plus": [col in drop_cols for col in df.columns]
})
before_missing after_missing dropped_50pct_plus
country 0 0 False
region 6 6 False
happiness_score 0 0 False
gdp_per_capita 3 0 False
social_support 3 0 False
healthy_life_expectancy 3 0 False
freedom_to_make_life_choices 3 0 False
generosity 3 0 False
perceptions_of_corruption 3 0 False
Thinking

Finalizing response format

I need to ensure my final answer is clear without resorting to code. I'll include insights from the latest outputs instead of just restating them. It's important to mention any numeric data that is missing and how it’s been imputed, while noting that no columns were dropped since none had over 50% missing values. I’ll aim for a brief paragraph or bullet points without mentioning that the goal is complete. Let’s make this concise!

Based on the latest output:

  • No columns were dropped because none exceeded the 50% missing threshold. In fact, the highest missing rate shown earlier was only 4.2% for region.
  • All numeric missing values were filled successfully with medians:
    • gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, and perceptions_of_corruption each went from 3 missing values to 0.
    • happiness_score already had no missing values.
  • region still has 6 missing values, which is expected here because your instruction was only to fill numeric missing values, and region is non-numeric.

So after this cleaning step, the dataset keeps all original columns, and the only remaining missing data is in the non-numeric region column.

rename columns to snake_case and convert any string numbers to float
Thinking

Transforming data columns

I need to respond by checking the notebook state for just one step. The user asked to rename columns to snake_case and convert string numbers to floats. I likely have df_clean to work with, since the original goal was data cleaning. It seems I should detect which object/string columns can convert numeric-like strings and apply a safe conversion with pd.to_numeric(errors='coerce'). I’ll compare the number of non-null values for accuracy before and after the conversion.Renaming and transforming columns

It seems like I need to rename columns to snake_case, but the current ones are already in that format—like country, region, and others. I’ll implement a general rename function for consistency, but since no string-number columns are detected, the dataset likely won’t change much. I should create a new df_clean or mutate in place, using df_clean.copy(). I’ll include a helper function and ensure everything's in one code block, displaying the data types afterward. Minimizing code is key here!

These columns already look like snake_case, but this will standardize them anyway and safely convert any fully numeric string columns to floats.

Clean column names and types
import re

df_clean = df_clean.copy()

def to_snake(name):
    name = re.sub(r"[^0-9a-zA-Z]+", "_", str(name).strip())
    name = re.sub(r"([a-z0-9])([A-Z])", r"\1_\2", name)
    name = re.sub(r"_+", "_", name).strip("_").lower()
    return name

df_clean.columns = [to_snake(col) for col in df_clean.columns]

converted_cols = []
for col in df_clean.select_dtypes(include=["object", "string"]).columns:
    s = df_clean[col].astype("string").str.strip()
    numeric = pd.to_numeric(s, errors="coerce")
    if s.notna().sum() > 0 and numeric.notna().sum() == s.notna().sum():
        df_clean[col] = numeric.astype(float)
        converted_cols.append(col)

pd.DataFrame({
    "column": df_clean.columns,
    "dtype": df_clean.dtypes.astype(str).values
})
column dtype
0 country str
1 region str
2 happiness_score float64
3 gdp_per_capita float64
4 social_support float64
5 healthy_life_expectancy float64
6 freedom_to_make_life_choices float64
7 generosity float64
8 perceptions_of_corruption float64

From the latest output, the column names are already in snake_case after standardization:

country, region, happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption.

The dtype check also shows that only country and region remain as strings, while all metric columns are float64. That means there were no string-number columns left to convert beyond what was already numeric. In other words:

  • Snake_case renaming: applied successfully.
  • String numbers to float: nothing additional needed; numeric fields are already floats.
  • Text columns preserved correctly: country and region stayed as strings because they contain categorical text, not numeric values.
show a before/after summary of the cleaned dataset
Thinking

Analyzing summary needs

I need to create a summary table to compare the original dataset with the cleaned version based on the user's goal. The summary should show columns, data types, missing values, and possibly shapes. Displaying one concise table may suffice, especially if I can make a compact comparison of metrics like rows and missing data. Since the user asked for a before/after summary, I should note that columns are already in snake_case and no conversions were necessary. I'll keep it straightforward!Building summary DataFrame

I need to create a summary DataFrame that captures key metrics like rows, columns, total missing values, and specifics like numeric and string columns. Since the cleaned dataset only has median-imputed numeric missing values and 6 remaining in the region, I should mention the unchanged snake_case format. I can track changes in column names and any string-to-float conversions, referencing converted_cols from a previous cell. Overall, this summary aims to provide a clear before-and-after comparison for the cleaned dataset!

Here’s a compact before/after summary comparing df and df_clean. It also confirms that column names did not actually need changing, and no string-number columns were converted.

Create data quality summary
summary = pd.DataFrame({
    "metric": [
        "rows",
        "columns",
        "total_missing",
        "numeric_missing",
        "string_missing",
        "numeric_columns",
        "string_columns",
        "duplicate_rows",
        "duplicate_countries",
        "renamed_columns_count",
        "string_number_columns_converted",
    ],
    "before": [
        df.shape[0],
        df.shape[1],
        int(df.isna().sum().sum()),
        int(df.select_dtypes(include="number").isna().sum().sum()),
        int(df.select_dtypes(include=["object", "string"]).isna().sum().sum()),
        int(len(df.select_dtypes(include="number").columns)),
        int(len(df.select_dtypes(include=["object", "string"]).columns)),
        int(df.duplicated().sum()),
        int(df["country"].duplicated().sum()),
        int((df.columns != df_clean.columns).sum()),
        0,
    ],
    "after": [
        df_clean.shape[0],
        df_clean.shape[1],
        int(df_clean.isna().sum().sum()),
        int(df_clean.select_dtypes(include="number").isna().sum().sum()),
        int(df_clean.select_dtypes(include=["object", "string"]).isna().sum().sum()),
        int(len(df_clean.select_dtypes(include="number").columns)),
        int(len(df_clean.select_dtypes(include=["object", "string"]).columns)),
        int(df_clean.duplicated().sum()),
        int(df_clean["country"].duplicated().sum()),
        int((df.columns != df_clean.columns).sum()),
        int(len(converted_cols)),
    ],
})

summary
metric before after
0 rows 143 143
1 columns 9 9
2 total_missing 24 6
3 numeric_missing 18 0
4 string_missing 6 6
5 numeric_columns 7 7
6 string_columns 2 2
7 duplicate_rows 0 0
8 duplicate_countries 0 0
9 renamed_columns_count 0 0
10 string_number_columns_converted 0 0
Thinking

Summarizing analysis process

It looks like we need to choose path A since a summary output should suffice. I’ll provide a natural final answer based on previous outputs and include a clear before/after summary. We have an output-5 summary to work with. It's worth mentioning that column renaming count is 0 and the string number conversions are 0 since everything is already clean. Also, the total missing has decreased from 24 to 6, while the numeric missing is down to 0. The shape remains 143x9 with no duplicates. I’m wondering if I should avoid headings but could still use bullet points instead.

Here’s the before/after summary of the cleaned dataset based on the notebook outputs:

  • The dataset stayed the same size: 143 rows and 9 columns before and after cleaning.
  • Total missing values dropped from 24 to 6.
  • All numeric missing values were resolved: 18 → 0.
  • String missing values stayed at 6 → 6, which means the remaining missing values are all in text fields, specifically region.
  • The dataset structure did not change:
    • 7 numeric columns before and after
    • 2 string columns before and after
  • There were no duplicate rows and no duplicate countries either before or after cleaning.
  • No columns were renamed, because they were already in snake_case.
  • No string-number columns were converted, because the numeric fields were already stored as floats.

In short, the cleaning step mainly improved missing numeric data while preserving the dataset’s shape and column structure. The only remaining issue is the 6 missing values in region.

Browse All Analysis Examples