Data Analysis
Data Cleaning with Pandas in Python
Clean a messy real-world dataset: handle missing values, fix dtypes, remove duplicates, and standardize columns using pandas and an AI data analyst.
What
This AI Data Analyst workflow loads the World Happiness Report 2024 CSV and profiles data quality issues such as missing values, incorrect dtypes, and duplicate rows. It cleans the dataset by dropping columns with more than 50% missingness, filling remaining numeric nulls with column medians, and converting numeric-looking strings to floats. It standardizes column names to snake_case and produces a before/after summary comparing shape, dtypes, and null counts.
Who
This is for analysts and data scientists who need a repeatable pandas-based cleaning routine for a real-world CSV. It helps anyone preparing data for downstream analysis by making cleaning steps explicit and verifiable with before/after checks.
Tools
- pandas
- numpy
- python
- jupyter
Outcomes
- Data quality report covering missing values, dtype problems, and duplicate rows
- Columns with >50% missing values removed
- Numeric missing values filled with column medians
- Snake_case column names and corrected numeric dtypes
- Before/after comparison of dataset shape and null counts
Quality Score
9/10
Last scored: Apr 7, 2026
Task Completion: 2/2
ExcellentCovers all requested steps: loads data, reports missing/dtypes/duplicates, drops >50% missing columns (none), imputes numeric medians, standardizes column names to snake_case, attempts numeric conversion, and provides before/after summary.
Execution Correctness: 2/2
ExcellentCode is coherent and likely runnable end-to-end: reads CSV, computes summaries, performs median imputation on numeric columns, renames columns, and conditionally converts object columns using pd.to_numeric without breaking non-numeric text.
Output Quality: 2/3
GoodOutputs include a quality table (missing counts/pct and dtypes), duplicate count, remaining missing after imputation, dtypes after renaming/conversion, and a before/after summary. However, expected outcome 'null counts reduced to 0' is not met (region still has 6 missing), and no explicit dtype issues are demonstrated beyond confirming none.
Reasoning Quality: 2/2
ExcellentReasoning aligns with the evidence: correctly notes no >50% missing columns, numeric NaNs filled by medians, and remaining missingness is in non-numeric 'region'. Explains dtype assessment and conversion approach appropriately.
Reliability: 1/1
ExcellentWorkflow is reasonably robust (median imputation limited to numeric columns; safe conditional conversion of object columns). Minor fragility: the 'nulls to 0' expectation is unmet because categorical missing values are not handled.