How to Analyze a CSV File in Python

A step-by-step AI data analyst session: load a CSV, inspect structure, handle missing values, and generate a full exploratory summary.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Superstore Sales CSV from a URL and inspects its structure with shape, column dtypes, and a preview of the first rows. It checks for missing values and reports counts by column, then computes summary statistics for numeric fields. It generates distribution plots for Sales, Profit, and Shipping Cost to support exploratory analysis.

Who this example is for

This is for analysts and students who need a repeatable way to profile a new CSV dataset in Python. It helps anyone who wants an AI-assisted notebook that produces both tabular summaries and basic distribution visualizations.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Dataset shape, dtypes, and first 5 rows
Missing value counts by column
Summary statistics for numeric columns via describe()
Histograms with KDE for Sales, Profit, and Shipping Cost

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
numpy
matplotlib
seaborn
requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows
2check for missing values and show counts by column
3show summary statistics for all numeric columns
4plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
glm-5.1	9/10	2/2	2/2	3/3	1/2	1/1	Open glm-5.1 conversation
gpt-5.4	9/10	2/2	2/2	3/3	1/2	1/1	Open gpt-5.4 conversation
qwen3-coder-next	9/10	2/2	2/2	3/3	1/2	1/1	Open qwen3-coder-next conversation
qwen3.5:397b	4/10	1/2	1/2	1/3	1/2	0/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task How to Analyze a CSV File in Python. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows

Thinking

The user wants to load a specific CSV file from a URL and inspect it (shape, dtypes, and head).

Plan:

Load the CSV using pandas.
Display the shape of the dataframe.
Display the dtypes.
Display the first 5 rows.

Constraints:

Return at most one Python code block.
No comma-separated expressions like df.head(), df.shape.
Use display() for multiple tables/outputs.
End code block with a bare expression if possible, but since I need shape and dtypes, I'll use display() or just print them. Actually, df.shape and df.dtypes are not dataframes, so I'll print them or display them.

Step: Load and inspect the data.

I'll start by loading the dataset from the provided URL and inspecting its basic structure.

Load and inspect data

import pandas as pd

url = 'https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv'
df = pd.read_csv(url, encoding='latin1')

print(f"Shape: {df.shape}")
display(df.dtypes)
df.head()

Shape: (51290, 24)

Row ID              int64
Order ID              str
Order Date            str
Ship Date             str
Ship Mode             str
Customer ID           str
Customer Name         str
Segment               str
City                  str
State                 str
Country               str
Postal Code       float64
Market                str
Region                str
Product ID            str
Category              str
Sub-Category          str
Product Name          str
Sales             float64
Quantity            int64
Discount          float64
Profit            float64
Shipping Cost     float64
Order Priority        str
dtype: object

	Row ID	Order ID	Order Date	Ship Date	Ship Mode	Customer ID	Customer Name	Segment	City	State	...	Product ID	Category	Sub-Category	Product Name	Sales	Quantity	Discount	Profit	Shipping Cost	Order Priority
0	42433	AG-2011-2040	1/1/2011	6/1/2011	Standard Class	TB-11280	Toby Braunhardt	Consumer	Constantine	Constantine	...	OFF-TEN-10000025	Office Supplies	Storage	Tenex Lockers, Blue	408.300	2	0.0	106.140	35.46	Medium
1	22253	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	OFF-SU-10000618	Office Supplies	Supplies	Acme Trimmer, High Speed	120.366	3	0.1	36.036	9.72	Medium
2	48883	HU-2011-1220	1/1/2011	5/1/2011	Second Class	AT-735	Annie Thurman	Consumer	Budapest	Budapest	...	OFF-TEN-10001585	Office Supplies	Storage	Tenex Box, Single Width	66.120	4	0.0	29.640	8.17	High
3	11731	IT-2011-3647632	1/1/2011	5/1/2011	Second Class	EM-14140	Eugene Moren	Home Office	Stockholm	Stockholm	...	OFF-PA-10001492	Office Supplies	Paper	Enermax Note Cards, Premium	44.865	3	0.5	-26.055	4.82	High
4	22255	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	FUR-FU-10003447	Furniture	Furnishings	Eldon Light Bulb, Duo Pack	113.670	5	0.1	37.770	4.70	Medium

5 rows × 24 columns

Thinking

The user wants to load a specific CSV file and display its shape, dtypes, and the first 5 rows. Looking at the notebook state:

[cell-1] executed pd.read_csv(url, encoding='latin1').
[output-1] shows:
- Shape: (51290, 24)
- Dtypes: A list of 24 columns with their types (int64, str, float64, etc.).
- Head: The first 5 rows are displayed.

The original goal was: "load the CSV file ... and show shape, dtypes and first 5 rows".

All three requirements (shape, dtypes, head) have been met and are visible in the output.

Therefore, Path A is the correct choice. I should provide a natural final answer based on these outputs.

Insights from output:

The dataset contains 51,290 rows and 24 columns.
Column types vary between integers (e.g., Row ID, Quantity), floats (e.g., Sales, Profit, Shipping Cost), and strings (e.g., Order ID, Customer Name).
The data includes a mix of temporal (Order Date, Ship Date), categorical (Segment, Region, Category), and numerical metrics.

Final Answer Construction: The dataset has been successfully loaded with 51,290 rows and 24 columns. The columns consist of mixed data types, including identifiers (strings), temporal dates (strings), and numerical values for Sales, Quantity, Discount, Profit, and Shipping Cost (floats and integers).

Wait, the instructions say "provide a natural final answer based on observed outputs... do not include any Python code block."

I will summarize the findings.

The dataset has been loaded successfully, containing 51,290 rows and 24 columns. The columns include a mix of identifiers (like Order ID and Customer ID), categorical information (Segment, Market, Region), and numerical metrics such as Sales, Quantity, Discount, Profit, and Shipping Cost.

#10

check for missing values and show counts by column

#11

Thinking

The user wants to check for missing values in the dataframe df and display the counts of these missing values for each column.

df is already loaded and has 51,290 rows and 24 columns.
I need to calculate the sum of null values per column.

#12

I will calculate the total number of missing values for each column to see if any data cleaning is required.

#13

Count missing values

df.isnull().sum()

#14

Row ID                0
Order ID              0
Order Date            0
Ship Date             0
Ship Mode             0
Customer ID           0
Customer Name         0
Segment               0
City                  0
State                 0
Country               0
Postal Code       41296
Market                0
Region                0
Product ID            0
Category              0
Sub-Category          0
Product Name          0
Sales                 0
Quantity              0
Discount              0
Profit                0
Shipping Cost         0
Order Priority        0
dtype: int64

#15

Thinking

The user wants to check for missing values and show counts by column. The latest output (output-2) shows the result of df.isnull().sum().

Postal Code has 41,296 missing values.
All other columns have 0 missing values. The original goal was "check for missing values and show counts by column". The output directly provides these counts for every column. This is sufficient to answer the goal.

Path A.

#16

The check for missing values reveals that the dataset is largely complete, with the exception of the Postal Code column, which has 41,296 missing values. All other 23 columns contain no missing data.

#17

show summary statistics for all numeric columns

#18

Thinking

The user wants to see summary statistics for all numeric columns in the DataFrame df.

*   `df` is already loaded.
*   Numeric columns identified from previous output: `Row ID` (int64), `Postal Code` (float64), `Sales` (float64), `Quantity` (int64), `Discount` (float64), `Profit` (float64), `Shipping Cost` (float64).
*   Wait, `Row ID` is an identifier and likely shouldn't be in the summary statistics, but the user asked for "all numeric columns". Usually, `describe()` handles this.
*   The standard way to get summary statistics for numeric columns in pandas is `df.describe()`.

*   Call `df.describe()`.
*   Since the user specifically asked for numeric columns, `describe()` by default focuses on numeric columns unless `include='all'` is specified.

*   One runnable code block.
*   No `print` if not needed (though `describe()` returns a DataFrame, so it can be displayed directly).
*   Keep it minimal.

#19

I will generate the summary statistics for the numeric columns using the describe method.

#20

Show summary statistics

df.describe()

#21

	Row ID	Postal Code	Sales	Quantity	Discount	Profit	Shipping Cost
count	51290.00000	9994.000000	51290.000000	51290.000000	51290.000000	51290.000000	51290.000000
mean	25645.50000	55190.379428	246.490581	3.476545	0.142908	28.610982	26.375915
std	14806.29199	32063.693350	487.565361	2.278766	0.212280	174.340972	57.296804
min	1.00000	1040.000000	0.444000	1.000000	0.000000	-6599.978000	0.000000
25%	12823.25000	23223.000000	30.758625	2.000000	0.000000	0.000000	2.610000
50%	25645.50000	56430.500000	85.053000	3.000000	0.000000	9.240000	7.790000
75%	38467.75000	90008.000000	251.053200	5.000000	0.200000	36.810000	24.450000
max	51290.00000	99301.000000	22638.480000	14.000000	0.850000	8399.976000	933.570000

#22

Thinking

The user wants summary statistics for all numeric columns. Looking at the notebook state:

cell-1 loaded the data.
cell-2 checked for nulls.
cell-3 ran df.describe().
output-3 contains the summary statistics for Row ID, Postal Code, Sales, Quantity, Discount, Profit, and Shipping Cost.

The original goal was "show summary statistics for all numeric columns". The describe() method provides count, mean, std, min, 25%, 50%, 75%, and max for numeric columns. The output is present.

Insights from output-3:

Sales: Average is 246.49 USD, ranging from 0.44 USD to 22,638.48 USD.
Quantity: Average is 3.48 units, with a maximum of 14 units.
Discount: Ranges from 0 to 0.85 (85%), with a mean of 0.14.
Profit: Average is 28.61 USD, but there are significant losses (min is -6,599.98 USD).
Shipping Cost: Average is 26.38 USD.
Postal Code and Row ID are technically numeric but are identifiers, not metrics.

The goal is satisfied. Path A.

Constraint checklist & Confidence score:

No $? Yes. Use USD.
No chain-of-thought? Yes.
No headings? Yes.
Path A (final answer)? Yes.
Use latest output? Yes.
No coaching phrases? Yes.
Conciseness? Yes.

Confidence Score: 5/5

Wait, output-3 is truncated. I can see most of it, but it ends with ...[truncated]. However, for a "show summary statistics" request, the describe() output is the standard answer. The provided output snippet shows all the key financial metrics (Sales, Quantity, Discount, Profit, Shipping Cost).

Final answer formulation: The summary statistics for the numeric columns are as follows:

Sales: Mean of 246.49 USD, with a wide range from 0.44 USD up to 22,638.48 USD.
Quantity: Average order quantity is approximately 3.48 units, peaking at 14 units.
Discount: Average discount is 14.29%, with a maximum of 85%.
Profit: Average profit is 28.61 USD, although there is a minimum value of -6,599.98 USD, indicating substantial losses in some cases.
Shipping Cost: Average shipping cost is 26.38 USD.
Row ID and Postal Code are also present as numeric types, though they serve as identifiers.

#23

The summary statistics for the numeric columns show the following:

Sales: Average sales are 246.49 USD, with values ranging from a minimum of 0.44 USD to a maximum of 22,638.48 USD.
Quantity: The average quantity per order is approximately 3.48 units, with a maximum of 14 units.
Discount: The average discount is 14.29%, with a maximum discount of 85%.
Profit: The average profit is 28.61 USD, though the minimum value of -6,599.98 USD indicates significant losses on some orders.
Shipping Cost: The average shipping cost is 26.38 USD.
Row ID and Postal Code are also treated as numeric, though they function as identifiers.

#24

plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost

#25

Thinking

The user wants to plot distributions for three numeric columns: Sales, Profit, and Shipping Cost.

- Dataset is loaded as `df`.
- `Sales`, `Profit`, and `Shipping Cost` are float64.
- From `df.describe()`, I see that `Sales` has a max of 22,638.48 and `Profit` has a range from -6,599.98 to a high value. This suggests heavy skewness or outliers.

- Visualizing distributions of multiple columns is best done with histograms or KDE plots.
- Since there are 3 columns and they likely have different scales, I should either use subplots or a grid.
- Seaborn is the default library. `sns.histplot` or `sns.displot` are good choices.
- Because of the potential outliers (e.g., Sales max 22k vs mean 246), a standard histogram might be squashed. I should probably stick to a standard plot first and see, or use subplots to keep them distinct.

- Step: Create a figure with 3 subplots (1 row, 3 columns).
- Column 1: `Sales` distribution.
- Column 2: `Profit` distribution.
- Column 3: `Shipping Cost` distribution.
- Library: `seaborn` and `matplotlib.pyplot`.

- Use `sns.histplot(..., kde=True)` for each.
- Ensure labels and titles are clear.
- No interactive Altair unless requested.

- One runnable code block.
- Explain briefly.
- Wait for results.

#26

I'll create a grid of histograms with Kernel Density Estimate (KDE) overlays to visualize the distributions of Sales, Profit, and Shipping Cost.

#27

Plot distribution histograms

import matplotlib.pyplot as plt
import seaborn as sns

# Set up the figure with 3 subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

cols = ['Sales', 'Profit', 'Shipping Cost']
colors = ['blue', 'green', 'orange']

for i, col in enumerate(cols):
    sns.histplot(df[col], kde=True, ax=axes[i], color=colors[i])
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

#28

#29

Thinking

The user wants to plot distributions for Sales, Profit, and Shipping Cost. Looking at the notebook state:

Cell 1: Loaded data.
Cell 2: Checked for nulls.
Cell 3: df.describe() provides numeric summaries.
Cell 4: Executed the plotting code using sns.histplot with KDE.

The output-4 shows <Figure size 1800x500 with 3 Axes>, confirming the charts were generated. The goal was "plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost". The user has seen the plots (via the figure object) and the summary statistics (via df.describe() in cell 3).

According to the "Sufficiency heuristics":

"If the goal is to show/plot/chart/visualize and latest output contains both a valid chart/figure object and explicit numeric/text values supporting the conclusion, choose Path A".
In this case:
- Valid chart/figure object exists in output-4.
- Explicit numeric values supporting the conclusion (min, max, mean, etc.) are in output-3.

Therefore, I should choose Path A.

Insights from the outputs:

Sales: Heavy right-skew. Mean is 246.49 USD, but max is 22,638.48 USD, indicating a few very large orders.
Profit: Distribution contains both positive and negative values, with a significant minimum of -6,599.98 USD, suggesting some highly unprofitable transactions.
Shipping Cost: Generally lower range compared to sales, with a mean of 26.38 USD.

Wait, the user just wants the plots. The plots are done. I need to provide a natural final answer based on the observed outputs.

Currency formatting: Use USD. No $ signs.

Final Answer content:

Confirm the distributions have been plotted.
Describe the observations from the plots and the summary stats.
Sales: Right-skewed, most values are small, few are very high (up to 22,638.48 USD).
Profit: Centered around 28.61 USD but with a wide range, including deep losses (-6,599.98 USD).
Shipping Cost: Relatively stable with a mean of 26.38 USD.

Checking rules:

Path A: natural final answer, include , no Python code.
No chain-of-thought.
No headings.
Use USD.

One detail: The user's original goal was "plot distributions...". Since the code was executed and the figure was generated, and the statistics were already provided in cell 3, the goal is fulfilled.

#30

The distributions for Sales, Profit, and Shipping Cost reveal the following:

Sales: The distribution is heavily right-skewed, meaning most orders have relatively low values, while a few exceptionally large transactions reach up to 22,638.48 USD.
Profit: This column shows a wide variance. While the average profit is 28.61 USD, there are significant outliers on the negative side, with losses as deep as -6,599.98 USD.
Shipping Cost: The distribution is more concentrated, with an average cost of 26.38 USD, though it also exhibits a right skew with some high-cost shipments.

Score cards for gemma4:31b for How to Analyze a CSV File in Python

Total score: 10/10

Task Completion: 2/2
Excellent
All four requested steps were completed: loading the CSV, checking missing values, showing numeric summary statistics, and plotting three distributions.
Execution Correctness: 2/2
Excellent
The provided code is syntactically valid and the outputs shown are consistent with successful execution.
Output Quality: 3/3
Excellent
Outputs match the expected outcomes semantically: shape, dtypes, head, missing-value counts, describe() statistics, and three KDE histograms are all present.
Reasoning Quality: 2/2
Excellent
The workflow reasoning is clear and aligned with each prompt, with appropriate interpretation of the outputs.
Reliability: 1/1
Excellent
The workflow is consistent and robust overall, with no evident hallucinations or unsafe behavior.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

How to Analyze a CSV File in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gemma4:31b

Score cards for gemma4:31b for How to Analyze a CSV File in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics