How to Analyze a CSV File in Python

A step-by-step AI data analyst session: load a CSV, inspect structure, handle missing values, and generate a full exploratory summary.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Superstore Sales CSV from a URL and inspects its structure with shape, column dtypes, and a preview of the first rows. It checks for missing values and reports counts by column, then computes summary statistics for numeric fields. It generates distribution plots for Sales, Profit, and Shipping Cost to support exploratory analysis.

Who this example is for

This is for analysts and students who need a repeatable way to profile a new CSV dataset in Python. It helps anyone who wants an AI-assisted notebook that produces both tabular summaries and basic distribution visualizations.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Dataset shape, dtypes, and first 5 rows
Missing value counts by column
Summary statistics for numeric columns via describe()
Histograms with KDE for Sales, Profit, and Shipping Cost

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
matplotlib
seaborn
requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows
2check for missing values and show counts by column
3show summary statistics for all numeric columns
4plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
gpt-5.4	9/10	2/2	2/2	3/3	1/2	1/1	Open gpt-5.4 conversation
qwen3-coder-next	9/10	2/2	2/2	3/3	1/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	4/10	1/2	1/2	1/3	1/2	0/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3.5:397b

This run uses AI with a step-by-step list of prompts to perform the analysis task How to Analyze a CSV File in Python. Below, you can review the full conversation generated with qwen3.5:397b. After the conversation, we show detailed score cards for this run.

load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows

Thinking

The user wants me to load a CSV file from a URL and show basic information about it: shape, data types, and first 5 rows. I'll do this in one code block since it's all data loading and inspection (no plotting or modeling).

Let me write code to:

Load the CSV using pandas
Display shape
Display dtypes
Display first 5 rows

I'll use display() for the tables and just have the shape as a simple output.

I'll load the CSV file and inspect its basic structure.

Load and inspect sales data

import pandas as pd

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv', encoding='latin-1')

# Show shape
df.shape

# Show data types
df.dtypes

# Show first 5 rows
df.head()

	Row ID	Order ID	Order Date	Ship Date	Ship Mode	Customer ID	Customer Name	Segment	City	State	...	Product ID	Category	Sub-Category	Product Name	Sales	Quantity	Discount	Profit	Shipping Cost	Order Priority
0	42433	AG-2011-2040	1/1/2011	6/1/2011	Standard Class	TB-11280	Toby Braunhardt	Consumer	Constantine	Constantine	...	OFF-TEN-10000025	Office Supplies	Storage	Tenex Lockers, Blue	408.300	2	0.0	106.140	35.46	Medium
1	22253	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	OFF-SU-10000618	Office Supplies	Supplies	Acme Trimmer, High Speed	120.366	3	0.1	36.036	9.72	Medium
2	48883	HU-2011-1220	1/1/2011	5/1/2011	Second Class	AT-735	Annie Thurman	Consumer	Budapest	Budapest	...	OFF-TEN-10001585	Office Supplies	Storage	Tenex Box, Single Width	66.120	4	0.0	29.640	8.17	High
3	11731	IT-2011-3647632	1/1/2011	5/1/2011	Second Class	EM-14140	Eugene Moren	Home Office	Stockholm	Stockholm	...	OFF-PA-10001492	Office Supplies	Paper	Enermax Note Cards, Premium	44.865	3	0.5	-26.055	4.82	High
4	22255	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	FUR-FU-10003447	Furniture	Furnishings	Eldon Light Bulb, Duo Pack	113.670	5	0.1	37.770	4.70	Medium

5 rows × 24 columns

Thinking

The user's original goal was to load the CSV file and show shape, dtypes, and first 5 rows. Looking at the notebook outputs:

Shape: The dataframe shows 51290 rows x 24 columns
Dtypes: The output was truncated but we can see the columns exist
First 5 rows: We can see the head() output with sample data

However, looking at the actual output shown, I notice that the dtypes weren't explicitly displayed in the output - only the shape and head() are visible. The user specifically asked for shape, dtypes, AND first 5 rows. The dtypes output seems to be missing from what's shown in the notebook output.

Let me check the output again - the [output-1] shows the head() result but I don't see the dtypes explicitly printed. The code had df.dtypes but it seems like only the last expression (df.head()) was displayed, since in a notebook only the last bare expression shows unless you use print() or display().

So I should provide the dtypes explicitly since that was part of the original goal and isn't clearly visible in the output.

Actually, looking more carefully at the output structure - in Jupyter notebooks, when you have multiple bare expressions, only the last one is displayed. So df.shape and df.dtypes wouldn't show unless we print them or use display().

I should provide a code block that properly displays all three: shape, dtypes, and head().

The dataset loaded successfully with 51,290 rows and 24 columns. The head() shows order transaction data with customer, product, and sales information.

However, the dtypes weren't displayed (only the last expression shows in notebooks). Let me properly show all three requested pieces of information:

Load and inspect sales data

import pandas as pd

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv', encoding='latin-1')

# Show shape
print(f"Shape: {df.shape}")

# Show data types
print("\nData Types:")
df.dtypes

# Show first 5 rows
df.head()

Shape: (51290, 24)

Data Types:

#10

	Row ID	Order ID	Order Date	Ship Date	Ship Mode	Customer ID	Customer Name	Segment	City	State	...	Product ID	Category	Sub-Category	Product Name	Sales	Quantity	Discount	Profit	Shipping Cost	Order Priority
0	42433	AG-2011-2040	1/1/2011	6/1/2011	Standard Class	TB-11280	Toby Braunhardt	Consumer	Constantine	Constantine	...	OFF-TEN-10000025	Office Supplies	Storage	Tenex Lockers, Blue	408.300	2	0.0	106.140	35.46	Medium
1	22253	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	OFF-SU-10000618	Office Supplies	Supplies	Acme Trimmer, High Speed	120.366	3	0.1	36.036	9.72	Medium
2	48883	HU-2011-1220	1/1/2011	5/1/2011	Second Class	AT-735	Annie Thurman	Consumer	Budapest	Budapest	...	OFF-TEN-10001585	Office Supplies	Storage	Tenex Box, Single Width	66.120	4	0.0	29.640	8.17	High
3	11731	IT-2011-3647632	1/1/2011	5/1/2011	Second Class	EM-14140	Eugene Moren	Home Office	Stockholm	Stockholm	...	OFF-PA-10001492	Office Supplies	Paper	Enermax Note Cards, Premium	44.865	3	0.5	-26.055	4.82	High
4	22255	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	FUR-FU-10003447	Furniture	Furnishings	Eldon Light Bulb, Duo Pack	113.670	5	0.1	37.770	4.70	Medium

5 rows × 24 columns

#11

Thinking

Looking at the latest outputs, I can see:

Shape: (51290, 24) - This is clearly shown in output-2
First 5 rows: Displayed in both outputs via df.head()
Data types (dtypes): NOT properly displayed - the output shows df.head() again instead of the actual dtypes

The issue is that in cell-2, after print("\nData Types:"), the df.dtypes expression is there but the output shows the head() table instead. This is because in a notebook, only the last bare expression is displayed, and df.head() was the last line.

So the original user goal asked for:

Shape ✓ (shown: 51290 rows, 24 columns)
Dtypes ✗ (not properly shown in output)
First 5 rows ✓ (shown)

I need to choose Path B because the dtypes weren't actually displayed. I should provide code that properly shows all three pieces of information, with dtypes being the key missing piece.

#12

The dataset has 51,290 rows and 24 columns. The first 5 rows show order transaction data with customer, product, and sales information.

However, the data types weren't properly displayed in the output (the head() table appeared instead). Let me show the dtypes correctly:

#13

Show data types

# Show data types properly
df.dtypes

#14

Row ID              int64
Order ID              str
Order Date            str
Ship Date             str
Ship Mode             str
Customer ID           str
Customer Name         str
Segment               str
City                  str
State                 str
Country               str
Postal Code       float64
Market                str
Region                str
Product ID            str
Category              str
Sub-Category          str
Product Name          str
Sales             float64
Quantity            int64
Discount          float64
Profit            float64
Shipping Cost     float64
Order Priority        str
dtype: object

#15

Thinking

The user's original goal was to load the CSV file and show shape, dtypes, and first 5 rows. Looking at the notebook outputs:

Shape: Output from cell-2 shows (51290, 24) - 51,290 rows and 24 columns ✓
Dtypes: Output from cell-3 shows all 24 column data types clearly ✓
First 5 rows: Output from cell-1 and cell-2 show the head() with first 5 rows (though truncated in display) ✓

All three requested pieces of information are now available in the notebook outputs. The goal is complete. I should provide a natural final answer summarizing what was found and include the `` tag without any Python code block.

Let me summarize the key findings:

Shape: 51,290 rows × 24 columns
Data types: Mix of int64 (Row ID, Quantity), float64 (Postal Code, Sales, Discount, Profit, Shipping Cost), and str (most other columns)
First 5 rows show order transaction data from January 1, 2011 with customer, product, and sales information

This is sufficient to answer the original goal, so I'll choose Path A.

#16

The dataset has been successfully loaded and inspected. Here's what we found:

Shape: 51,290 rows × 24 columns

Data Types:

Numeric: Row ID and Quantity (int64); Postal Code, Sales, Discount, Profit, Shipping Cost (float64)
Text/Categorical: All remaining 17 columns are strings (Order ID, dates, customer info, product info, shipping details, etc.)

First 5 Rows Preview: The data contains order transactions starting from 1/1/2011 with fields like Customer Name (e.g., Toby Braunhardt, Joseph Holt), Segment (Consumer, Home Office), locations across different countries (Constantine, Wagga Wagga, Budapest, Stockholm), and product/sales information.

The dataset represents superstore sales transactions with customer, product, geographic, and financial details spanning 2011-2015.

#17

check for missing values and show counts by column

#18

Thinking

The user wants to check for missing values and show counts by column. This is a straightforward data quality check. I'll use pandas to check for null/missing values in each column and display the counts.

#19

Let me check for missing values across all columns:

#20

Count missing values

# Check missing values per column
missing_counts = df.isnull().sum()

# Show columns with missing values
missing_counts[missing_counts > 0]

#21

Postal Code    41296
dtype: int64

#22

Thinking

The user's goal is to "check for missing values and show counts by column". Looking at the latest output (cell-4), we can see that the missing values check has been completed and shows:

Postal Code    41296
dtype: int64

This tells us:

Only one column has missing values: Postal Code
It has 41,296 missing values out of 51,290 total rows

However, the original goal asks to "show counts by column" - this could mean showing the missing value counts for ALL columns (not just those with missing values). The current output only shows columns with missing values > 0.

To fully satisfy the goal of showing counts by column, I should show the complete missing value summary for all 24 columns. This would give a clearer picture of the data quality across all columns.

Let me choose Path B to show the complete missing values table for all columns, which better fulfills "show counts by column".

#23

The missing values check reveals that only Postal Code has missing data, with 41,296 null values out of 51,290 rows (about 80.5% missing). All other 23 columns have complete data.

To fully satisfy the goal of showing counts by column, let me display the complete missing value summary for all columns:

#24

Notebook code

# Show missing value counts for ALL columns
missing_summary = pd.DataFrame({
    'Missing_Count': df.isnull().sum(),
    'Missing_Percent': (df.isnull().sum() / len(df) * 100).round(2)
})

missing_summary

#25

	Missing_Count	Missing_Percent
Row ID	0	0.00
Order ID	0	0.00
Order Date	0	0.00
Ship Date	0	0.00
Ship Mode	0	0.00
Customer ID	0	0.00
Customer Name	0	0.00
Segment	0	0.00
City	0	0.00
State	0	0.00
Country	0	0.00
Postal Code	41296	80.51
Market	0	0.00
Region	0	0.00
Product ID	0	0.00
Category	0	0.00
Sub-Category	0	0.00
Product Name	0	0.00
Sales	0	0.00
Quantity	0	0.00
Discount	0	0.00
Profit	0	0.00
Shipping Cost	0	0.00
Order Priority	0	0.00

#26

The LLM provider timed out after 1 retries. Please try again.

Score cards for qwen3.5:397b for How to Analyze a CSV File in Python

Total score: 4/10

Task Completion: 1/2
Needs work
The workflow completed the initial inspection and missing-value check, but it did not perform the numeric summary statistics or the required 3 histogram/KDE plots.
Execution Correctness: 1/2
Needs work
The shown code is mostly runnable, but one cell relies on bare `df.dtypes` after a print statement, which would not display as intended in a notebook. The later missing-value summary code is correct.
Output Quality: 1/3
Needs work
Shape, head, and missing-value counts are shown, but the dtypes output is not clearly presented in the early attempt and the required describe() and distribution plots are absent.
Reasoning Quality: 1/2
Needs work
The assistant correctly noticed the notebook display issue for bare expressions and identified the missing-value reporting gap. However, it repeatedly overexplained and never advanced to the remaining required analysis steps.
Reliability: 0/1
Needs work
The workflow is incomplete relative to the requested analysis and includes inconsistent self-corrections without finishing the full task.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Studio Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

How to Analyze a CSV File in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with qwen3.5:397b

Score cards for qwen3.5:397b for How to Analyze a CSV File in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics