Data Analysis

How to Analyze a CSV File in Python

A step-by-step AI data analyst session: load a CSV, inspect structure, handle missing values, and generate a full exploratory summary.

What

This AI Data Analyst workflow loads the Superstore Sales CSV from a URL and inspects its structure with shape, column dtypes, and a preview of the first rows. It checks for missing values and reports counts by column, then computes summary statistics for numeric fields. It generates distribution plots for Sales, Profit, and Shipping Cost to support exploratory analysis.

Who

This is for analysts and students who need a repeatable way to profile a new CSV dataset in Python. It helps anyone who wants an AI-assisted notebook that produces both tabular summaries and basic distribution visualizations.

Tools

pandas
numpy
matplotlib
seaborn
requests

Outcomes

Dataset shape, dtypes, and first 5 rows
Missing value counts by column
Summary statistics for numeric columns via describe()
Histograms with KDE for Sales, Profit, and Shipping Cost

Quality Score

9/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 2/2
Excellent
All required steps are present: CSV loaded with shape/dtypes/head, missing-value counts by column, numeric describe output, and distributions plotted for Sales/Profit/Shipping Cost.
Execution Correctness: 2/2
Excellent
Code is syntactically correct and consistent with the shown outputs (df loads, missing counts compute, describe runs, seaborn histplot with kde renders a 3-panel figure).
Output Quality: 3/3
Excellent
Outputs match expected outcomes: shape table, dtypes table, head(5), missing-value table, numeric summary statistics, and 3 histograms with KDE for the specified columns.
Reasoning Quality: 1/2
Needs work
Reasoning is mostly correct (e.g., skew/outliers, Postal Code missingness), but includes some unnecessary meta-commentary and the final narrative is truncated in the last section.
Reliability: 1/1
Excellent
Workflow is reasonably robust for this dataset (explicit column list, standard pandas/seaborn methods), though it assumes the three columns exist without validation.

load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows

Thinking

Planning code for data inspection

I need to follow an iterative, step-by-step process to provide code for loading and inspecting data. I should show the shape, data types, and head of the DataFrame using pandas in a notebook. It's essential to keep the code minimal and display information sequentially. I might use import pandas as pd and from IPython.display import display, loading the CSV and displaying the shape, data types, and head sequentially. This keeps everything clear and organized!

Here’s the first notebook step to load the CSV and inspect its shape, data types, and first 5 rows.

Inspect dataset shape and types

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv"

df = pd.read_csv(url, encoding="latin1")

shape_df = pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]})
display(shape_df)

dtypes_df = df.dtypes.rename("dtype").reset_index().rename(columns={"index": "column"})
display(dtypes_df)

df.head()

	rows	columns
0	51290	24

	column	dtype
0	Row ID	int64
1	Order ID	str
2	Order Date	str
3	Ship Date	str
4	Ship Mode	str
5	Customer ID	str
6	Customer Name	str
7	Segment	str
8	City	str
9	State	str
10	Country	str
11	Postal Code	float64
12	Market	str
13	Region	str
14	Product ID	str
15	Category	str
16	Sub-Category	str
17	Product Name	str
18	Sales	float64
19	Quantity	int64
20	Discount	float64
21	Profit	float64
22	Shipping Cost	float64
23	Order Priority	str

	Row ID	Order ID	Order Date	Ship Date	Ship Mode	Customer ID	Customer Name	Segment	City	State	...	Product ID	Category	Sub-Category	Product Name	Sales	Quantity	Discount	Profit	Shipping Cost	Order Priority
0	42433	AG-2011-2040	1/1/2011	6/1/2011	Standard Class	TB-11280	Toby Braunhardt	Consumer	Constantine	Constantine	...	OFF-TEN-10000025	Office Supplies	Storage	Tenex Lockers, Blue	408.300	2	0.0	106.140	35.46	Medium
1	22253	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	OFF-SU-10000618	Office Supplies	Supplies	Acme Trimmer, High Speed	120.366	3	0.1	36.036	9.72	Medium
2	48883	HU-2011-1220	1/1/2011	5/1/2011	Second Class	AT-735	Annie Thurman	Consumer	Budapest	Budapest	...	OFF-TEN-10001585	Office Supplies	Storage	Tenex Box, Single Width	66.120	4	0.0	29.640	8.17	High
3	11731	IT-2011-3647632	1/1/2011	5/1/2011	Second Class	EM-14140	Eugene Moren	Home Office	Stockholm	Stockholm	...	OFF-PA-10001492	Office Supplies	Paper	Enermax Note Cards, Premium	44.865	3	0.5	-26.055	4.82	High
4	22255	IN-2011-47883	1/1/2011	8/1/2011	Standard Class	JH-15985	Joseph Holt	Consumer	Wagga Wagga	New South Wales	...	FUR-FU-10003447	Furniture	Furnishings	Eldon Light Bulb, Duo Pack	113.670	5	0.1	37.770	4.70	Medium

5 rows × 24 columns

Thinking

Summarizing data insights

I need to focus on delivering a final answer based on the outputs while aligning with the original goal of loading the CSV and showing its shape, data types, and the first five rows. I should incorporate insights from the outputs rather than just restating them. I might summarize the shape, highlight that there are mixed data types with date columns still as strings, and note that the first five rows show the expected sales fields. I'll keep it concise without using headings for a more natural flow.

The file is loaded and the requested inspection is available.

It contains 51,290 rows and 24 columns.

From the observed dtypes:

Numeric columns include Row ID, Quantity, Sales, Discount, Profit, and Shipping Cost
Postal Code is float64
The date fields Order Date and Ship Date are currently loaded as strings, not datetime yet
Most of the remaining fields are text columns such as customer, location, market, product, and category information

The first 5 rows show typical order-level records with fields like:

identifiers: Row ID, Order ID, Customer ID, Product ID
dates and shipping info: Order Date, Ship Date, Ship Mode
customer and geography: Customer Name, Segment, City, State, Country, Market, Region
product and transaction values: Category, Sub-Category, Product Name, Sales, Quantity, Discount, Profit, Shipping Cost, Order Priority

So the requested shape, dtypes, and first 5 rows have all been shown.

#10

check for missing values and show counts by column

#11

Thinking

Analyzing missing values

I need to provide a concise code snippet based on the current notebook's state. The user wants to check for missing values and show counts by column. There’s a conflict because the dataset shows two different DataFrame shapes. Based on the instructions, I'll rely on the notebook cell output as the current truth and make code that counts missing values per column. I’ll ensure it’s minimal and ends with a simple display of the table.

#12

Summarize missing values

missing_df = (
    df.isna()
      .sum()
      .rename("missing_count")
      .reset_index()
      .rename(columns={"index": "column"})
      .sort_values("missing_count", ascending=False)
      .reset_index(drop=True)
)

missing_df

#13

	column	missing_count
0	Postal Code	41296
1	Row ID	0
2	Region	0
3	Shipping Cost	0
4	Profit	0
5	Discount	0
6	Quantity	0
7	Sales	0
8	Product Name	0
9	Sub-Category	0
10	Category	0
11	Product ID	0
12	Market	0
13	Order ID	0
14	Country	0
15	State	0
16	City	0
17	Segment	0
18	Customer Name	0
19	Customer ID	0
20	Ship Mode	0
21	Ship Date	0
22	Order Date	0
23	Order Priority	0

#14

The missing-value check shows that only one column has missing data:

Postal Code: 41,296 missing values

All other 23 columns have 0 missing values.

This means the dataset is otherwise complete, and the only missingness to handle is in Postal Code.

#15

show summary statistics for all numeric columns

#16

Thinking

Generating summary statistics

Okay, the user wants to see summary statistics for all numeric columns. I need to take a step-by-step approach here. I can provide code to select and describe numeric columns directly in a notebook. There’s no need to ask for clarification, since it's straightforward.

I'll use a code block and keep it concise:

numeric_summary = df.select_dtypes(include="number").describe().T
numeric_summary

I might consider rounding the values, and I’ll include integer columns as well. It’s important to deliver a clear answer with a brief explanation along with the code.

#17

Here are the summary statistics for all numeric columns.

#18

Summarize numeric columns

numeric_summary = df.select_dtypes(include="number").describe().T

numeric_summary

#19

	count	mean	std	min	25%	50%	75%	max
Row ID	51290.0	25645.500000	14806.291990	1.000	12823.250000	25645.500	38467.7500	51290.000
Postal Code	9994.0	55190.379428	32063.693350	1040.000	23223.000000	56430.500	90008.0000	99301.000
Sales	51290.0	246.490581	487.565361	0.444	30.758625	85.053	251.0532	22638.480
Quantity	51290.0	3.476545	2.278766	1.000	2.000000	3.000	5.0000	14.000
Discount	51290.0	0.142908	0.212280	0.000	0.000000	0.000	0.2000	0.850
Profit	51290.0	28.610982	174.340972	-6599.978	0.000000	9.240	36.8100	8399.976
Shipping Cost	51290.0	26.375915	57.296804	0.000	2.610000	7.790	24.4500	933.570

#20

Thinking

Considering DataFrame Structure

I’m looking at the DataFrame, which has rows and columns. After the "Discount" row, I should avoid using exact values since they’re truncated. Instead, it might be better to focus on means that are visible before truncation. The table includes columns like count, mean, standard deviation, min, and max, with the first seven rows shown. I think all rows might be visible across the lines, but it’s worth double-checking for clarity.

#21

The numeric summary covers these 7 columns: Row ID, Postal Code, Sales, Quantity, Discount, Profit, Shipping Cost.

A few useful takeaways from the summary:

Row ID runs from 1 to 51,290 with a midpoint of 25,645.5. This is an identifier, so its summary is not analytically meaningful beyond confirming the range.
Postal Code has only 9,994 non-missing values, which matches the earlier missing-value check and explains why its count is much lower than the other numeric columns.
Sales are strongly right-skewed:
- mean: 246.49 USD
- median: 85.05 USD
- max: 22,638.48 USD
The mean being much higher than the median suggests a small number of very large orders.
Quantity is fairly concentrated:
- median: 3
- 25th percentile: 2
- 75th percentile: 5
- max: 14
So most orders contain only a few items.
Discount is generally low:
- mean: 0.143
- 25th percentile: 0
This suggests many rows have no discount at all.
Profit has a positive average but large downside:
- mean: 28.61 USD
- minimum: -6599.978 USD
So while orders are profitable on average, some orders incur very large losses.
Shipping Cost also appears right-skewed:
- mean: 26.38 USD
- 25th percentile: 2.61 USD
This suggests many shipments are inexpensive, with a smaller number of costly shipments pulling the mean upward.

Overall, the summary shows a dataset with typical small-to-moderate orders, occasional very large sales, mostly low discounts, and some extreme negative-profit cases.

#22

plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost

#23

Thinking

Planning distribution plots

I need to create an answer based on the notebook's current state and the user's prompt. The user asked to plot distributions for the top three numeric columns: Sales, Profit, and Shipping Cost. I should probably do one chart per step, as per the instructions. Although the user requested three distributions, I can start with one figure containing three histograms. I'll utilize libraries like seaborn or matplotlib and provide code for plotting the distributions, analyzing any skewness or missing data afterward.Proposing code for distributions

I’m at the point where I need to propose code for the distribution plots without any fresh output yet. Since there aren't any missing outputs, I'll aim for a concise approach. I'll suggest using histplot with a KDE overlay. Because Sales and Shipping Cost are highly skewed, I might need to compress those scales. However, Profit contains negative values, so I can’t apply the same method there. I believe it makes sense to present three histograms within one figure while keeping it minimal. I’ll end with plt.show() as the final expression.

#24

Based on the summary stats, all three are likely skewed, so a side-by-side histogram view is a good first check.

#25

Plot feature distributions

import matplotlib.pyplot as plt
import seaborn as sns

cols = ["Sales", "Profit", "Shipping Cost"]

fig, axes = plt.subplots(1, 3, figsize=(18, 4))

for ax, col in zip(axes, cols):
    sns.histplot(df[col], bins=40, kde=True, ax=ax)
    ax.set_title(f"Distribution of {col}")
    ax.set_xlabel(col)
    ax.set_ylabel("Count")

plt.tight_layout()
plt.show()

#26

#27

Summarize distribution statistics

distribution_stats = df[["Sales", "Profit", "Shipping Cost"]].agg(
    ["count", "mean", "std", "min", "median", "max"]
).T

distribution_stats

#28

	count	mean	std	min	median	max
Sales	51290.0	246.490581	487.565361	0.444	85.053	22638.480
Profit	51290.0	28.610982	174.340972	-6599.978	9.240	8399.976
Shipping Cost	51290.0	26.375915	57.296804	0.000	7.790	933.570

#29

The distribution plots for Sales, Profit, and Shipping Cost are supported by the summary values and point to three different shapes:

Sales is strongly right-skewed: the median is only 85.053 USD while the mean is 246.49 USD, with a maximum of 22,638.48 USD. That means most orders are relatively small, with a smaller number of very large sales stretching the distribution to the right.
Profit is much more spread out and includes both gains and losses: the median is 9.24 USD, the mean is 28.61 USD, the minimum is -6,599.978 USD, and the maximum is 8,399.976 USD. So the distribution is not only skewed, but also has substantial negative values, which suggests a long left tail from loss-making orders and a long right tail from highly profitable ones.
Shipping Cost is also right-skewed: the median is 7.79 USD versus a mean of 26.38 USD, with values reaching 933.57 USD. Most shipments appear low-cost, with a smaller number of expensive shipments creating the long right tail.

So, the plotted distributions are consistent with a dataset where Sales and Shipping Cost are dominated by many low-to-moderate values plus a few extreme high values, while Profit is more volatile because it spans both negative and positive outcomes.

How to Analyze a CSV File in Python

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics