Data Cleaning with Pandas in Python

Clean a messy real-world dataset: handle missing values, fix dtypes, remove duplicates, and standardize columns using pandas and an AI data analyst.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the World Happiness Report 2024 CSV and profiles data quality issues such as missing values, incorrect dtypes, and duplicate rows. It cleans the dataset by dropping columns with more than 50% missingness, filling remaining numeric nulls with column medians, and converting numeric-looking strings to floats. It standardizes column names to snake_case and produces a before/after summary comparing shape, dtypes, and null counts.

Who this example is for

This is for analysts and data scientists who need a repeatable pandas-based cleaning routine for a real-world CSV. It helps anyone preparing data for downstream analysis by making cleaning steps explicit and verifiable with before/after checks.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Data quality report covering missing values, dtype problems, and duplicate rows
  • Columns with >50% missing values removed
  • Numeric missing values filled with column medians
  • Snake_case column names and corrected numeric dtypes
  • Before/after comparison of dataset shape and null counts

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • python
  • jupyter

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv and identify data quality issues: missing values, wrong dtypes, duplicates
  2. 2fill missing numeric values with column medians and drop columns with >50% missing
  3. 3rename columns to snake_case and convert any string numbers to float
  4. 4show a before/after summary of the cleaned dataset

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gpt-5.48/102/22/22/31/21/1Open gpt-5.4 conversation
glm-5.17/101/22/22/31/21/1Open glm-5.1 conversation
gpt-oss:120b6/102/21/22/31/20/1Open gpt-oss:120b conversation
gemma4:31b5/101/21/22/31/20/1Open gemma4:31b conversation
qwen3.5:397b5/101/21/22/31/20/1Open qwen3.5:397b conversation
qwen3-coder-next3/101/20/21/31/20/1Open qwen3-coder-next conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-oss:120b

This run uses AI with a step-by-step list of prompts to perform the analysis task Data Cleaning with Pandas in Python. Below, you can review the full conversation generated with gpt-oss:120b. After the conversation, we show detailed score cards for this run.

load the dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv and identify data quality issues: missing values, wrong dtypes, duplicates
Thinking

We need to load dataset from URL, then display shape, info, head, check missing values, dtypes, duplicates. Provide code block that loads into pandas df, shows shape, columns, dtypes, missing count, duplicate rows count. According to guidelines, we should do one step at a time: first load and inspect only shape, columns, head. So we should load and show basic info. Then next step would be missing values and dtypes etc. So first step: load and show shape, columns, head. Provide code block.

Let’s bring the CSV into a DataFrame and take a quick look at its size, columns, and first few rows.

Load happiness dataset
import pandas as pd

# Load the dataset
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv"
df = pd.read_csv(url)

# Basic overview
df.shape, df.columns, df.head()
((143, 9),
 Index(['country', 'region', 'happiness_score', 'gdp_per_capita',
        'social_support', 'healthy_life_expectancy',
        'freedom_to_make_life_choices', 'generosity',
        'perceptions_of_corruption'],
       dtype='str'),
    country                        region  happiness_score  gdp_per_capita  \
 0  Finland                Western Europe            7.741           1.844   
 1  Denmark                Western Europe            7.583           1.908   
 2  Iceland                Western Europe            7.525           1.881   
 3   Sweden                Western Europe            7.344           1.878   
 4   Israel  Middle East and North Africa            7.341           1.803   
 
    social_support  healthy_life_expectancy  freedom_to_make_life_choices  \
 0           1.572                    0.695                         0.859   
 1           1.520                    0.699                         0.823   
 2           1.617                    0.718                         0.819   
 3           1.501                    0.724                         0.838   
 4           1.513                    0.740                         0.641   
 
    generosity  perceptions_of_corruption  
 0       0.142                      0.546  
 1       0.204                      0.548  
 2       0.258                      0.182  
 3       0.221                      0.524  
 4       0.153                      0.193  )
Thinking

We have loaded the dataset and displayed shape, columns, head. Need to identify data quality issues: missing values, wrong dtypes, duplicates.

We have shape (143,9). Need to check missing values, dtypes, duplicates. So we need to compute df.isnull().sum(), df.dtypes, df.duplicated().sum().

Thus current info insufficient: need missing values, dtypes, duplicates. So choose Path B with one code block that reports these.

Below is a quick check of data‑quality aspects you asked for:

Calculate missing counts and duplicates
# Missing values per column
missing = df.isnull().sum()

# Current dtypes
dtypes = df.dtypes

# Duplicate rows (full‑row duplicates)
dup_count = df.duplicated().sum()

missing, dtypes, dup_count
(country                         0
 region                          6
 happiness_score                 0
 gdp_per_capita                  3
 social_support                  3
 healthy_life_expectancy         3
 freedom_to_make_life_choices    3
 generosity                      3
 perceptions_of_corruption       3
 dtype: int64,
 country                             str
 region                              str
 happiness_score                 float64
 gdp_per_capita                  float64
 social_support                  float64
 healthy_life_expectancy         float64
 freedom_to_make_life_choices    float64
 generosity                      float64
 perceptions_of_corruption       float64
 dtype: object,
 np.int64(0))
Thinking

We have loaded dataset and computed missing values, dtypes, duplicate count. We need to identify data quality issues: missing values, wrong dtypes, duplicates. The outputs show missing counts per column, dtypes, duplicate count zero. Need to interpret if any dtypes are wrong. Columns: country, region are strings (good). Others are float64 (numeric). Likely correct. However maybe region has missing values (6). Also other columns have 3 missing each in some numeric columns. So we can answer. No need further code. Path A.

Data‑quality assessment

ColumnDetected dtypeMissing values
countrystring (object)0
regionstring (object)6
happiness_scorefloat640
gdp_per_capitafloat643
social_supportfloat643
healthy_life_expectancyfloat643
freedom_to_make_life_choicesfloat643
generosityfloat643
perceptions_of_corruptionfloat643
  • Missing values – The dataset has a total of 30 missing entries: 6 in region and 3 in each of the numeric indicators (gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption).

  • Wrong dtypes – All columns have appropriate data types: country and region are textual, and all metric columns are numeric (float64). No dtype mismatches are evident.

  • Duplicates – No duplicate rows were found (0 duplicates).

fill missing numeric values with column medians and drop columns with >50% missing
Thinking

We need to fill missing numeric values with column medians, and drop columns with >50% missing. From data, no column has >50% missing (max missing is 6 out of 143 ~4%). So we will just fill numeric columns with median. Provide code that identifies numeric columns, computes median, fills, maybe also verify no missing left. Then output maybe df.head() and missing after. Provide code block.

Here’s the next step — replace the missing numeric entries with the median of each column (no column exceeds the 50 % missing‑value threshold, so none will be dropped).

Fill numeric missing values
# Identify numeric columns (exclude the object columns)
numeric_cols = df.select_dtypes(include="number").columns

# Fill missing values with the median of each numeric column
for col in numeric_cols:
    median_val = df[col].median()
    df[col].fillna(median_val, inplace=True)

# Verify that no missing values remain in numeric columns
remaining_missing = df[numeric_cols].isnull().sum()
remaining_missing
/tmp/ipykernel_2874295/3359343507.py:7: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  df[col].fillna(median_val, inplace=True)
/tmp/ipykernel_2874295/3359343507.py:7: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  df[col].fillna(median_val, inplace=True)
/tmp/ipykernel_2874295/3359343507.py:7: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  df[col].fillna(median_val, inplace=True)
/tmp/ipykernel_2874295/3359343507.py:7: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  df[col].fillna(median_val, inplace=True)
/tmp/ipykernel_2874295/3359343507.py:7: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  df[col].fillna(median_val, inplace=True)
/tmp/ipykernel_2874295/3359343507.py:7: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  df[col].fillna(median_val, inplace=True)
/tmp/ipykernel_2874295/3359343507.py:7: ChainedAssignmentError: A value is being set on a copy of a DataFrame or Series through chained assignment using an inplace method.
Such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy (due to Copy-on-Write).

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object, or try to avoid an inplace operation using 'df[col] = df[col].method(value)'.

See the documentation for a more detailed explanation: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html
  df[col].fillna(median_val, inplace=True)
happiness_score                 0
gdp_per_capita                  3
social_support                  3
healthy_life_expectancy         3
freedom_to_make_life_choices    3
generosity                      3
perceptions_of_corruption       3
dtype: int64
Thinking

We need to decide if the goal (fill missing numeric values with column medians and drop columns with >50% missing) is satisfied. We have run code to fill missing numeric values with median, but got a ChainedAssignmentError warning and the remaining_missing series still shows missing values for numeric columns: gdp_per_capita 3 etc. So missing values were not filled. So we haven't completed filling. Also need to drop columns with >50% missing; none have >50% missing (max missing is 6 out of 143 = ~4%). So dropping none.

Thus we need to correctly fill numeric missing values using assignment without warning, and confirm no missing remain. Provide minimal code to fill correctly and then check missing counts, plus maybe drop columns >50% missing (none). So Path B: Provide insights and code.

Provide insight: numeric missing remain, need to refactor. Provide code to fill using df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median()) and then check missing and drop columns >50%.

One code block.

The median‑imputation step didn’t actually replace the missing numeric entries (the remaining_missing series still shows 3 missing values in each numeric column). No column exceeds the 50 % missing threshold, so none need to be dropped. Let’s correctly fill the numeric gaps with their column medians and then verify that no missing values remain.

Median impute numeric columns
# Median‑impute all numeric columns in one go (avoids the chained‑assignment warning)
numeric_cols = df.select_dtypes(include="number").columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Check for any remaining missing values
missing_after = df.isnull().sum()
missing_after
country                         0
region                          6
happiness_score                 0
gdp_per_capita                  0
social_support                  0
healthy_life_expectancy         0
freedom_to_make_life_choices    0
generosity                      0
perceptions_of_corruption       0
dtype: int64
Thinking

Goal: fill missing numeric values with column medians and drop columns with >50% missing. We have performed median imputation for numeric columns and printed missing_after, showing region still has 6 missing (but region is non-numeric); all numeric columns have 0 missing. Need to drop columns with >50% missing. Check column missing percentages. Region has 6 missing out of 143 ~4.2%, not >50%, so none to drop. Need to confirm no column >50% missing. Provide final answer summarizing that numeric missing values filled with median and no columns dropped. That's sufficient. Choose Path A.

All numeric columns now have 0 missing values after median imputation, and no column exceeds the 50 % missing‑value threshold (the only column with missing entries, region, has 6 missing out of 143 rows ≈ 4 %). Consequently, the dataset required no column deletions.

rename columns to snake_case and convert any string numbers to float
Thinking

We need to rename columns to snake_case and convert any string numbers to float. The dataset has columns: 'country', 'region', 'happiness_score', 'gdp_per_capita', ... already snake_case except maybe spaces? Actually they're already snake_case. But maybe some columns have spaces like 'happiness_score' already snake. Might need to ensure all columns are snake_case: lower, replace spaces with underscore, maybe other punctuation. Also need to convert any string numbers to float: In this df, numeric columns are already float, but maybe some string numbers exist (none observed). However we should implement conversion: for object columns, attempt to coerce to numeric with errors='ignore' or use pd.to_numeric with errors='coerce' and fill back.

Implement:

  • Define function to snake_case columns: use .str.strip().str.lower().str.replace(' ', '').str.replace('-', '')

  • Apply to df.columns.

  • For each object column, try to convert with pd.to_numeric; if conversion yields any non-NaNs, replace column with converted floats.

Return df.head() maybe also show dtypes.

Provide code block.

Clean column names and types
# 1️⃣ Rename columns to snake_case
df.columns = (
    df.columns
    .str.strip()               # remove leading/trailing spaces
    .str.lower()               # lower‑case
    .str.replace(r"[^\w]+", "_", regex=True)  # replace non‑alphanum (spaces, hyphens, etc.) with _
    .str.replace("__+", "_")   # collapse multiple underscores
)

# 2️⃣ Convert any string‑encoded numbers to float
for col in df.select_dtypes(include="object").columns:
    # try to coerce to numeric; if at least one value converts, replace the column
    converted = pd.to_numeric(df[col], errors="coerce")
    if converted.notna().any():
        df[col] = converted.astype(float)

# Quick check
df.head(), df.dtypes
/tmp/ipykernel_2874295/2871608373.py:11: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  for col in df.select_dtypes(include="object").columns:
(   country                        region  happiness_score  gdp_per_capita  \
 0  Finland                Western Europe            7.741           1.844   
 1  Denmark                Western Europe            7.583           1.908   
 2  Iceland                Western Europe            7.525           1.881   
 3   Sweden                Western Europe            7.344           1.878   
 4   Israel  Middle East and North Africa            7.341           1.803   
 
    social_support  healthy_life_expectancy  freedom_to_make_life_choices  \
 0           1.572                    0.695                         0.859   
 1           1.520                    0.699                         0.823   
 2           1.617                    0.718                         0.819   
 3           1.501                    0.724                         0.838   
 4           1.513                    0.740                         0.641   
 
    generosity  perceptions_of_corruption  
 0       0.142                      0.546  
 1       0.204                      0.548  
 2       0.258                      0.182  
 3       0.221                      0.524  
 4       0.153                      0.193  ,
 country                             str
 region                              str
 happiness_score                 float64
 gdp_per_capita                  float64
 social_support                  float64
 healthy_life_expectancy         float64
 freedom_to_make_life_choices    float64
 generosity                      float64
 perceptions_of_corruption       float64
 dtype: object)
Thinking

We need to decide if outputs already satisfy original goal: rename columns to snake_case and convert any string numbers to float. The latest output (cell-5) shows df.head() and df.dtypes. It displayed columns after rename: country, region, happiness_score, gdp_per_capita, social_support, healthy_life_expectancy, freedom_to_make_life_choices, generosity, perceptions_of_corruption. Dtypes: country object, region object, others float64, presumably. The conversion code attempted to convert any object columns that can be numeric; there were none (string numbers) maybe. The output shows dtypes, confirming numeric columns are floats, and no object columns containing numeric strings. So the goal is achieved. Provide final answer without code, including <!--final answer-->.

Thus Path A.

The columns have been renamed to snake_case (e.g., happiness_score, gdp_per_capita, freedom_to_make_life_choices) and all numeric values are stored as float64. No object‑type columns contain numeric strings.

show a before/after summary of the cleaned dataset
Thinking

We need to show before and after summary of cleaned dataset. We have earlier state before rename/convert: original df columns and dtypes; after rename and conversion we have new. Provide summary: shape unchanged, column names list before and after, dtypes before and after, maybe example of before vs after for a few rows.

We have outputs: earlier cell-1 gave shape, columns, head. We'll reconstruct before columns: from cell-1 output (not displayed fully but we have original column names). In output-5 we see after rename columns snake_case.

We need to show before/after side by side. Use pandas to create a summary DataFrame with original column names and new column names, plus dtype changes. We may need original column list; we can re-load original? But we have mutated df. Could read from earlier snapshot? Not stored. But we can reconstruct original column list by reading CSV again into df_original (without the modifications) and compare.

We'll provide code to load original again into df_raw, then compare.

Provide a concise summary: number of columns, shape, missing values before and after (though missing values were already imputed earlier). But focus on rename and type conversion.

We'll output a table with columns: original_name, cleaned_name, original_dtype, cleaned_dtype.

Also show head before and after maybe.

We need one code block that does this and displays the summary. Then stop.

Create column summary
import pandas as pd

# -------------------------------------------------
# 1️⃣ Load a fresh copy of the raw data (before any cleaning)
# -------------------------------------------------
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/world_happiness_report/WHR_2024.csv"
df_raw = pd.read_csv(url)

# -------------------------------------------------
# 2️⃣ Build a side‑by‑side summary
# -------------------------------------------------
summary = pd.DataFrame({
    "original_name": df_raw.columns,
    "cleaned_name":   df.columns,                # df is the already‑cleaned DataFrame in the notebook
    "original_dtype": df_raw.dtypes.astype(str),
    "cleaned_dtype":  df.dtypes.astype(str)
})

# Show the first rows of the summary (all rows are displayed because it’s short)
summary
original_name cleaned_name original_dtype cleaned_dtype
country country country str str
region region region str str
happiness_score happiness_score happiness_score float64 float64
gdp_per_capita gdp_per_capita gdp_per_capita float64 float64
social_support social_support social_support float64 float64
healthy_life_expectancy healthy_life_expectancy healthy_life_expectancy float64 float64
freedom_to_make_life_choices freedom_to_make_life_choices freedom_to_make_life_choices float64 float64
generosity generosity generosity float64 float64
perceptions_of_corruption perceptions_of_corruption perceptions_of_corruption float64 float64
Thinking

We need to decide if current outputs answer "show a before/after summary of the cleaned dataset". The latest output (cell-6) is a summary DataFrame showing original_name, cleaned_name, original_dtype, cleaned_dtype. That seems to be exactly the before/after summary. The user wants that; we have it. We just need to present it in natural language, possibly embed a markdown table? The instruction: provide final answer without code block. Should include <!--final answer-->.

We need to extract concrete insights: show mapping and dtype changes. Provide a concise summary.

Let's format as markdown table. Use the data from output. The output shows many columns (though truncated). We need to present entire summary maybe limited to rows. Could show all rows (there are 9 columns originally). We'll present the table.

We also note that numeric columns were converted to float; original dtype for numeric columns were float already? In original raw data, they were float, but some may be object with string numbers. The summary shows original dtype: happiness_score 'float64'?? In output truncated. But we can state that column names are now snake_case and dtypes are float64 for numeric columns, object for strings.

Thus answer.

Below is a side‑by‑side view of the dataset before and after the cleaning step:

original_namecleaned_nameoriginal_dtypecleaned_dtype
countrycountrystrstr
regionregionstrstr
happiness_scorehappiness_scorefloat64float64
gdp_per_capitagdp_per_capitafloat64float64
social_supportsocial_supportfloat64float64
healthy_life_expectancyhealthy_life_expectancyfloat64float64
freedom_to_make_life_choicesfreedom_to_make_life_choicesfloat64float64
generositygenerosityfloat64float64
perceptions_of_corruptionperceptions_of_corruptionfloat64float64

What this tells us

  • All column names have been converted to snake_case (e.g., freedom_to_make_life_choices).
  • String‑encoded numeric values, if any, were coerced to float64; the numeric columns now have a consistent float64 dtype.
  • The two categorical columns (country, region) remain as strings.

Score cards for gpt-oss:120b for Data Cleaning with Pandas in Python

Total score: 6/10

  • Task Completion: 2/2

    Excellent

    The workflow covers loading, quality checks, median imputation, snake_case renaming, dtype conversion, and a before/after summary. It addresses all major requested steps.

  • Execution Correctness: 1/2

    Needs work

    The median fill step used chained assignment and the notebook output shows the missing numeric values were not actually resolved at that point. Later code appears to fix this, but the workflow contains a clear execution issue.

  • Output Quality: 2/3

    Good

    It reports missing counts, dtypes, duplicate count, and a before/after summary. However, the cleaning output is not fully consistent because the earlier imputation output still showed unresolved missing values.

  • Reasoning Quality: 1/2

    Needs work

    The reasoning is mostly correct about the dataset having no >50% missing columns and no duplicates. But it incorrectly treated the failed chained-assignment imputation as completed before later correcting it.

  • Reliability: 0/1

    Needs work

    The workflow is fragile because it relies on a broken inplace fill pattern and includes inconsistent intermediate states. This reduces confidence in the cleaning results.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.