How to Analyze a CSV File in Python

A step-by-step AI data analyst session: load a CSV, inspect structure, handle missing values, and generate a full exploratory summary.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Superstore Sales CSV from a URL and inspects its structure with shape, column dtypes, and a preview of the first rows. It checks for missing values and reports counts by column, then computes summary statistics for numeric fields. It generates distribution plots for Sales, Profit, and Shipping Cost to support exploratory analysis.

Who this example is for

This is for analysts and students who need a repeatable way to profile a new CSV dataset in Python. It helps anyone who wants an AI-assisted notebook that produces both tabular summaries and basic distribution visualizations.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Dataset shape, dtypes, and first 5 rows
Missing value counts by column
Summary statistics for numeric columns via describe()
Histograms with KDE for Sales, Profit, and Shipping Cost

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
matplotlib
seaborn
requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows
2check for missing values and show counts by column
3show summary statistics for all numeric columns
4plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
glm-5.1	9/10	2/2	2/2	3/3	1/2	1/1	Open glm-5.1 conversation
gpt-5.4	9/10	2/2	2/2	3/3	1/2	1/1	Open gpt-5.4 conversation
qwen3-coder-next	9/10	2/2	2/2	3/3	1/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	4/10	1/2	1/2	1/3	1/2	0/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task How to Analyze a CSV File in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows

Thinking

Refining code structure

I need to keep my approach iterative, loading and inspecting data in one go. I should return just one code block that shows shape, data types, and the head of the DataFrame, likely using pandas. I shouldn’t combine tables, so if multiple tables are needed, I should display them sequentially. I might create a shape_df or use display to show the DataFrame effectively. Keeping it concise with no extra questions is key! Let's get this done!

Here’s the load-and-inspect step.

Load and inspect dataset

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv"

df = pd.read_csv(url, encoding="latin1")

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(df.dtypes.rename("dtype").reset_index().rename(columns={"index": "column"}))
df.head()

	rows	columns
0	51290	24

	column	dtype
0	Row ID	int64
1	Order ID	str
2	Order Date	str
3	Ship Date	str
4	Ship Mode	str
5	Customer ID	str
6	Customer Name	str
7	Segment	str
8	City	str
9	State	str
10	Country	str
11	Postal Code	float64
12	Market	str
13	Region	str
14	Product ID	str
15	Category	str
16	Sub-Category	str
17	Product Name	str
18	Sales	float64
19	Quantity	int64
20	Discount	float64
21	Profit	float64
22	Shipping Cost	float64
23	Order Priority	str

	Row ID	Order ID	Order Date	Ship Date	Ship Mode	Customer ID	Customer Name	Segment	City	State	...	Product ID	Category	Sub-Category	Product Name	Sales	Quantity	Discount	Profit	Shipping Cost	Order Priority
0	42433	AG-2011-2040	1/1/2011	6/1/2011	Standard Class	TB-11280	Toby Braunhardt	Consumer	Constantine	Constantine	...	OFF-TEN-10000025	Office Supplies	Storage	Tenex Lockers, Blue	408.300	2	0.0	106.140	35.46	Medium
1	22253	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	OFF-SU-10000618	Office Supplies	Supplies	Acme Trimmer, High Speed	120.366	3	0.1	36.036	9.72	Medium
2	48883	HU-2011-1220	1/1/2011	5/1/2011	Second Class	AT-735	Annie Thurman	Consumer	Budapest	Budapest	...	OFF-TEN-10001585	Office Supplies	Storage	Tenex Box, Single Width	66.120	4	0.0	29.640	8.17	High
3	11731	IT-2011-3647632	1/1/2011	5/1/2011	Second Class	EM-14140	Eugene Moren	Home Office	Stockholm	Stockholm	...	OFF-PA-10001492	Office Supplies	Paper	Enermax Note Cards, Premium	44.865	3	0.5	-26.055	4.82	High
4	22255	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	FUR-FU-10003447	Furniture	Furnishings	Eldon Light Bulb, Duo Pack	113.670	5	0.1	37.770	4.70	Medium

5 rows × 24 columns

The file contents are visible and match the requested inspection.

The dataset has 51,290 rows and 24 columns.
The schema is mixed:
- numeric columns include Row ID, Postal Code, Sales, Quantity, Discount, Profit, and Shipping Cost
- the rest are text-like fields such as order, customer, location, and product information
Order Date and Ship Date are currently loaded as strings, not datetime types yet.
The first 5 rows show this is a transactional sales table with identifiers, customer/location fields, product details, and sales metrics.

From the observed dtypes:

int64: Row ID, Quantity
float64: Postal Code, Sales, Discount, Profit, Shipping Cost
string/object-like: all remaining columns, including Order ID, Customer ID, Category, Sub-Category, and Order Priority

The first 5 rows confirm columns such as: Row ID, Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer Name, Segment, City, State, Country, Market, Region, Product ID, Category, Sub-Category, Product Name, Sales, Quantity, Discount, Profit, Shipping Cost, and Order Priority.

check for missing values and show counts by column

#10

Thinking

Formulating code for missing values

I’m preparing to answer the user about checking missing values and showing counts by column. I'll need to confirm if the DataFrame exists; the notebook state variables only show limited info. The output suggests the DataFrame was created, so it's probably safe to include code that uses it. I aim to provide a concise introduction and the code in one block, focusing on displaying missing counts for all columns, or just those with more than zero missing values. I'll finalize it with a simple display command.Creating a concise missing values table

In the notebook, I want to display a focused and minimal result. I’ll start by calculating the missing values using df.isna().sum() and rename the count, then reset the index to get the correct format. I’ll also compute the percentage of missing values by dividing the count by the total number of rows in the DataFrame, rounded to four decimal places. Finally, I’ll sort the table by missing count and present it cleanly, ensuring everything fits into a single table without extra prints.

#11

Here’s the missing-value count by column.

#12

Summarize missing values

missing = df.isna().sum().rename("missing_count").reset_index().rename(columns={"index": "column"})
missing["missing_pct"] = (missing["missing_count"] / len(df)).round(4)
missing.sort_values("missing_count", ascending=False).reset_index(drop=True)

#13

	column	missing_count	missing_pct
0	Postal Code	41296	0.8051
1	Row ID	0	0.0000
2	Region	0	0.0000
3	Shipping Cost	0	0.0000
4	Profit	0	0.0000
5	Discount	0	0.0000
6	Quantity	0	0.0000
7	Sales	0	0.0000
8	Product Name	0	0.0000
9	Sub-Category	0	0.0000
10	Category	0	0.0000
11	Product ID	0	0.0000
12	Market	0	0.0000
13	Order ID	0	0.0000
14	Country	0	0.0000
15	State	0	0.0000
16	City	0	0.0000
17	Segment	0	0.0000
18	Customer Name	0	0.0000
19	Customer ID	0	0.0000
20	Ship Mode	0	0.0000
21	Ship Date	0	0.0000
22	Order Date	0	0.0000
23	Order Priority	0	0.0000

#14

Thinking

Determining the answer path

I'm deciding that we should take answer path A because it outputs sufficient information. It’s important for me to include a final answer comment as well. I'll ensure I provide concise insights based on what I've observed. Also, I need to mention that the only postal code missing is 41296, with a count of 80.51%. All other codes are at 0. I'll avoid including any code in my final response.

#15

The missing-value check shows that only one column has missing data:

Postal Code: 41,296 missing values, which is 80.51% of all rows

All other columns have 0 missing values.

So for this dataset, missing-data handling is only needed for Postal Code; the rest of the columns are complete.

#16

show summary statistics for all numeric columns

#17

Thinking

Planning summary statistics

I'm focusing on providing summary statistics for all numeric columns as the user requested. Since we have the notebook state loaded, I’ll create one concise code block that uses the describe function on numeric columns. I don't think I need to ask for anything more. I should also mention that the numeric columns might include something like Row ID. I’ll make sure to display the results in a readable table format. Let's keep it minimal and clear!

#18

Here are summary statistics for all numeric columns.

#19

Summarize numeric columns

numeric_summary = df.select_dtypes(include="number").describe().T
numeric_summary

#20

	count	mean	std	min	25%	50%	75%	max
Row ID	51290.0	25645.500000	14806.291990	1.000	12823.250000	25645.500	38467.7500	51290.000
Postal Code	9994.0	55190.379428	32063.693350	1040.000	23223.000000	56430.500	90008.0000	99301.000
Sales	51290.0	246.490581	487.565361	0.444	30.758625	85.053	251.0532	22638.480
Quantity	51290.0	3.476545	2.278766	1.000	2.000000	3.000	5.0000	14.000
Discount	51290.0	0.142908	0.212280	0.000	0.000000	0.000	0.2000	0.850
Profit	51290.0	28.610982	174.340972	-6599.978	0.000000	9.240	36.8100	8399.976
Shipping Cost	51290.0	26.375915	57.296804	0.000	2.610000	7.790	24.4500	933.570

#21

Here are the key summary statistics from the numeric columns:

Row ID ranges from 1 to 51,290 with a mean of 25,645.5, which looks like a sequential identifier rather than a business measure.
Postal Code has only 9,994 non-missing values, confirming it is heavily incomplete; among available values, the median is 56,430.5.
Sales are strongly right-skewed: median sales are 85.05, while mean sales are 246.49, and the maximum reaches 22,638.48. That gap suggests a small number of very large orders.
Quantity is fairly concentrated: median 3, interquartile range from 2 to 5, with a maximum of 14.
Discount is usually low: the 25th and 50th percentiles are both 0, and the mean is 0.143, so many orders had no discount.
Profit shows substantial spread: median profit is 0, mean profit is 28.61, maximum is 8,399.98 USD, and minimum is -6,599.98 USD. This indicates many low-profit or break-even orders, with some large gains and losses.
Shipping Cost is also right-skewed: median 7.79, mean 26.38, and maximum 933.57, so a minority of shipments are much more expensive than typical.

Overall, the numeric summary shows:

one identifier-like numeric field (Row ID),
one numeric field with many missing values (Postal Code),
and several business metrics (Sales, Quantity, Discount, Profit, Shipping Cost) with noticeable skew and outliers, especially for sales, profit, and shipping cost.

#22

plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost

#23

Thinking

Considering histogram options

I'm exploring the best way to represent a summary with strong skewness. I think using histograms with KDE might be effective, but I need to avoid altering the data. A log-scaled x-axis for profit won't work due to negative values. So, separate histograms with many bins could be better, especially for right-skewed distributions like Sales and Shipping Cost. I’ll create a single figure with three histograms, including titles and axis labels, while keeping it concise in notebook style.

#24

These three are all skewed, so a side-by-side histogram view is a good first look.

#25

Plot feature distributions

import matplotlib.pyplot as plt
import seaborn as sns

cols = ["Sales", "Profit", "Shipping Cost"]

fig, axes = plt.subplots(1, 3, figsize=(18, 4))

for ax, col in zip(axes, cols):
    sns.histplot(data=df, x=col, bins=50, ax=ax, color="steelblue")
    ax.set_title(f"Distribution of {col}")
    ax.set_xlabel(col)
    ax.set_ylabel("Count")

plt.tight_layout()
plt.show()

#26

#27

Summarize sales profit shipping stats

df[["Sales", "Profit", "Shipping Cost"]].describe().T[["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]

#28

	count	mean	std	min	25%	50%	75%	max
Sales	51290.0	246.490581	487.565361	0.444	30.758625	85.053	251.0532	22638.480
Profit	51290.0	28.610982	174.340972	-6599.978	0.000000	9.240	36.8100	8399.976
Shipping Cost	51290.0	26.375915	57.296804	0.000	2.610000	7.790	24.4500	933.570

#29

The distributions for Sales, Profit, and Shipping Cost are now plotted, and the summary stats help interpret what those histograms show:

Sales is strongly right-skewed. The median is about 85.05, while the mean is much higher at 246.49, with a maximum of 22638.48. That pattern indicates many smaller orders and a relatively small number of very large sales pulling the distribution to the right.
Profit is also right-skewed, but with an important twist: it includes negative values. The median profit is 9.24, the mean is 28.61, the minimum is -6599.98 USD, and the maximum is 8399.98 USD. So most orders appear to have modest profit, but there are both large gains and some substantial losses.
Shipping Cost is right-skewed as well. The median is 7.79, the mean is 26.38, and the maximum is 933.57. That suggests most shipments are inexpensive, with a smaller number of much more expensive shipments creating a long upper tail.

Overall, the three histograms likely show:

a heavy concentration near lower values,
long right tails,
and for Profit, a spread crossing below zero.

These are typical non-normal business distributions with outliers, especially for transaction-level data.

Score cards for gpt-5.4 for How to Analyze a CSV File in Python

Total score: 9/10

Task Completion: 2/2
Excellent
All requested steps were completed: load/inspect the CSV, check missing values, show numeric summary statistics, and plot the three requested distributions.
Execution Correctness: 2/2
Excellent
The code shown is syntactically valid and consistent with the outputs provided, including the dataframe load, missing-value table, describe output, and histogram plotting.
Output Quality: 3/3
Excellent
The outputs match the expected outcomes semantically: shape, dtypes, head, missing-value counts, numeric describe statistics, and three histograms are all present and relevant.
Reasoning Quality: 1/2
Needs work
The narrative is mostly correct and helpful, but it contains some inconsistencies and over-interpretation, such as mentioning a median profit of 9.24 while earlier describing profit as right-skewed with median 0 in the summary table.
Reliability: 1/1
Excellent
The workflow is generally consistent and reproducible, though it includes some contradictory commentary and a truncated final explanation, which slightly reduces robustness.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

How to Analyze a CSV File in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gpt-5.4

Score cards for gpt-5.4 for How to Analyze a CSV File in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics