Data Analysis

How to Analyze a CSV File in Python

A step-by-step AI data analyst session: load a CSV, inspect structure, handle missing values, and generate a full exploratory summary.

What

This AI Data Analyst workflow loads the Superstore Sales CSV from a URL and inspects its structure with shape, column dtypes, and a preview of the first rows. It checks for missing values and reports counts by column, then computes summary statistics for numeric fields. It generates distribution plots for Sales, Profit, and Shipping Cost to support exploratory analysis.

Who

This is for analysts and students who need a repeatable way to profile a new CSV dataset in Python. It helps anyone who wants an AI-assisted notebook that produces both tabular summaries and basic distribution visualizations.

Tools

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • requests

Outcomes

  • Dataset shape, dtypes, and first 5 rows
  • Missing value counts by column
  • Summary statistics for numeric columns via describe()
  • Histograms with KDE for Sales, Profit, and Shipping Cost

Quality Score

9/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

  • Task Completion: 2/2

    Excellent

    All required steps are present: CSV loaded with shape/dtypes/head, missing-value counts by column, numeric describe output, and distributions plotted for Sales/Profit/Shipping Cost.

  • Execution Correctness: 2/2

    Excellent

    Code is syntactically correct and consistent with the shown outputs (df loads, missing counts compute, describe runs, seaborn histplot with kde renders a 3-panel figure).

  • Output Quality: 3/3

    Excellent

    Outputs match expected outcomes: shape table, dtypes table, head(5), missing-value table, numeric summary statistics, and 3 histograms with KDE for the specified columns.

  • Reasoning Quality: 1/2

    Needs work

    Reasoning is mostly correct (e.g., skew/outliers, Postal Code missingness), but includes some unnecessary meta-commentary and the final narrative is truncated in the last section.

  • Reliability: 1/1

    Excellent

    Workflow is reasonably robust for this dataset (explicit column list, standard pandas/seaborn methods), though it assumes the three columns exist without validation.

load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows
Thinking

Planning code for data inspection

I need to follow an iterative, step-by-step process to provide code for loading and inspecting data. I should show the shape, data types, and head of the DataFrame using pandas in a notebook. It's essential to keep the code minimal and display information sequentially. I might use import pandas as pd and from IPython.display import display, loading the CSV and displaying the shape, data types, and head sequentially. This keeps everything clear and organized!

Here’s the first notebook step to load the CSV and inspect its shape, data types, and first 5 rows.

Inspect dataset shape and types
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv"

df = pd.read_csv(url, encoding="latin1")

shape_df = pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]})
display(shape_df)

dtypes_df = df.dtypes.rename("dtype").reset_index().rename(columns={"index": "column"})
display(dtypes_df)

df.head()
rows columns
0 51290 24
column dtype
0 Row ID int64
1 Order ID str
2 Order Date str
3 Ship Date str
4 Ship Mode str
5 Customer ID str
6 Customer Name str
7 Segment str
8 City str
9 State str
10 Country str
11 Postal Code float64
12 Market str
13 Region str
14 Product ID str
15 Category str
16 Sub-Category str
17 Product Name str
18 Sales float64
19 Quantity int64
20 Discount float64
21 Profit float64
22 Shipping Cost float64
23 Order Priority str
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment City State ... Product ID Category Sub-Category Product Name Sales Quantity Discount Profit Shipping Cost Order Priority
0 42433 AG-2011-2040 1/1/2011 6/1/2011 Standard Class TB-11280 Toby Braunhardt Consumer Constantine Constantine ... OFF-TEN-10000025 Office Supplies Storage Tenex Lockers, Blue 408.300 2 0.0 106.140 35.46 Medium
1 22253 IN-2011-47883 1/1/2011 8/1/2011 Standard Class JH-15985 Joseph Holt Consumer Wagga Wagga New South Wales ... OFF-SU-10000618 Office Supplies Supplies Acme Trimmer, High Speed 120.366 3 0.1 36.036 9.72 Medium
2 48883 HU-2011-1220 1/1/2011 5/1/2011 Second Class AT-735 Annie Thurman Consumer Budapest Budapest ... OFF-TEN-10001585 Office Supplies Storage Tenex Box, Single Width 66.120 4 0.0 29.640 8.17 High
3 11731 IT-2011-3647632 1/1/2011 5/1/2011 Second Class EM-14140 Eugene Moren Home Office Stockholm Stockholm ... OFF-PA-10001492 Office Supplies Paper Enermax Note Cards, Premium 44.865 3 0.5 -26.055 4.82 High
4 22255 IN-2011-47883 1/1/2011 8/1/2011 Standard Class JH-15985 Joseph Holt Consumer Wagga Wagga New South Wales ... FUR-FU-10003447 Furniture Furnishings Eldon Light Bulb, Duo Pack 113.670 5 0.1 37.770 4.70 Medium

5 rows × 24 columns

Thinking

Summarizing data insights

I need to focus on delivering a final answer based on the outputs while aligning with the original goal of loading the CSV and showing its shape, data types, and the first five rows. I should incorporate insights from the outputs rather than just restating them. I might summarize the shape, highlight that there are mixed data types with date columns still as strings, and note that the first five rows show the expected sales fields. I'll keep it concise without using headings for a more natural flow.

The file is loaded and the requested inspection is available.

It contains 51,290 rows and 24 columns.

From the observed dtypes:

  • Numeric columns include Row ID, Quantity, Sales, Discount, Profit, and Shipping Cost
  • Postal Code is float64
  • The date fields Order Date and Ship Date are currently loaded as strings, not datetime yet
  • Most of the remaining fields are text columns such as customer, location, market, product, and category information

The first 5 rows show typical order-level records with fields like:

  • identifiers: Row ID, Order ID, Customer ID, Product ID
  • dates and shipping info: Order Date, Ship Date, Ship Mode
  • customer and geography: Customer Name, Segment, City, State, Country, Market, Region
  • product and transaction values: Category, Sub-Category, Product Name, Sales, Quantity, Discount, Profit, Shipping Cost, Order Priority

So the requested shape, dtypes, and first 5 rows have all been shown.

check for missing values and show counts by column
Thinking

Analyzing missing values

I need to provide a concise code snippet based on the current notebook's state. The user wants to check for missing values and show counts by column. There’s a conflict because the dataset shows two different DataFrame shapes. Based on the instructions, I'll rely on the notebook cell output as the current truth and make code that counts missing values per column. I’ll ensure it’s minimal and ends with a simple display of the table.

Summarize missing values
missing_df = (
    df.isna()
      .sum()
      .rename("missing_count")
      .reset_index()
      .rename(columns={"index": "column"})
      .sort_values("missing_count", ascending=False)
      .reset_index(drop=True)
)

missing_df
column missing_count
0 Postal Code 41296
1 Row ID 0
2 Region 0
3 Shipping Cost 0
4 Profit 0
5 Discount 0
6 Quantity 0
7 Sales 0
8 Product Name 0
9 Sub-Category 0
10 Category 0
11 Product ID 0
12 Market 0
13 Order ID 0
14 Country 0
15 State 0
16 City 0
17 Segment 0
18 Customer Name 0
19 Customer ID 0
20 Ship Mode 0
21 Ship Date 0
22 Order Date 0
23 Order Priority 0

The missing-value check shows that only one column has missing data:

  • Postal Code: 41,296 missing values

All other 23 columns have 0 missing values.

This means the dataset is otherwise complete, and the only missingness to handle is in Postal Code.

show summary statistics for all numeric columns
Thinking

Generating summary statistics

Okay, the user wants to see summary statistics for all numeric columns. I need to take a step-by-step approach here. I can provide code to select and describe numeric columns directly in a notebook. There’s no need to ask for clarification, since it's straightforward.

I'll use a code block and keep it concise:

numeric_summary = df.select_dtypes(include="number").describe().T
numeric_summary

I might consider rounding the values, and I’ll include integer columns as well. It’s important to deliver a clear answer with a brief explanation along with the code.

Here are the summary statistics for all numeric columns.

Summarize numeric columns
numeric_summary = df.select_dtypes(include="number").describe().T

numeric_summary
count mean std min 25% 50% 75% max
Row ID 51290.0 25645.500000 14806.291990 1.000 12823.250000 25645.500 38467.7500 51290.000
Postal Code 9994.0 55190.379428 32063.693350 1040.000 23223.000000 56430.500 90008.0000 99301.000
Sales 51290.0 246.490581 487.565361 0.444 30.758625 85.053 251.0532 22638.480
Quantity 51290.0 3.476545 2.278766 1.000 2.000000 3.000 5.0000 14.000
Discount 51290.0 0.142908 0.212280 0.000 0.000000 0.000 0.2000 0.850
Profit 51290.0 28.610982 174.340972 -6599.978 0.000000 9.240 36.8100 8399.976
Shipping Cost 51290.0 26.375915 57.296804 0.000 2.610000 7.790 24.4500 933.570
Thinking

Considering DataFrame Structure

I’m looking at the DataFrame, which has rows and columns. After the "Discount" row, I should avoid using exact values since they’re truncated. Instead, it might be better to focus on means that are visible before truncation. The table includes columns like count, mean, standard deviation, min, and max, with the first seven rows shown. I think all rows might be visible across the lines, but it’s worth double-checking for clarity.

The numeric summary covers these 7 columns: Row ID, Postal Code, Sales, Quantity, Discount, Profit, Shipping Cost.

A few useful takeaways from the summary:

  • Row ID runs from 1 to 51,290 with a midpoint of 25,645.5. This is an identifier, so its summary is not analytically meaningful beyond confirming the range.

  • Postal Code has only 9,994 non-missing values, which matches the earlier missing-value check and explains why its count is much lower than the other numeric columns.

  • Sales are strongly right-skewed:

    • mean: 246.49 USD
    • median: 85.05 USD
    • max: 22,638.48 USD

    The mean being much higher than the median suggests a small number of very large orders.

  • Quantity is fairly concentrated:

    • median: 3
    • 25th percentile: 2
    • 75th percentile: 5
    • max: 14

    So most orders contain only a few items.

  • Discount is generally low:

    • mean: 0.143
    • 25th percentile: 0

    This suggests many rows have no discount at all.

  • Profit has a positive average but large downside:

    • mean: 28.61 USD
    • minimum: -6599.978 USD

    So while orders are profitable on average, some orders incur very large losses.

  • Shipping Cost also appears right-skewed:

    • mean: 26.38 USD
    • 25th percentile: 2.61 USD

    This suggests many shipments are inexpensive, with a smaller number of costly shipments pulling the mean upward.

Overall, the summary shows a dataset with typical small-to-moderate orders, occasional very large sales, mostly low discounts, and some extreme negative-profit cases.

plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost
Thinking

Planning distribution plots

I need to create an answer based on the notebook's current state and the user's prompt. The user asked to plot distributions for the top three numeric columns: Sales, Profit, and Shipping Cost. I should probably do one chart per step, as per the instructions. Although the user requested three distributions, I can start with one figure containing three histograms. I'll utilize libraries like seaborn or matplotlib and provide code for plotting the distributions, analyzing any skewness or missing data afterward.Proposing code for distributions

I’m at the point where I need to propose code for the distribution plots without any fresh output yet. Since there aren't any missing outputs, I'll aim for a concise approach. I'll suggest using histplot with a KDE overlay. Because Sales and Shipping Cost are highly skewed, I might need to compress those scales. However, Profit contains negative values, so I can’t apply the same method there. I believe it makes sense to present three histograms within one figure while keeping it minimal. I’ll end with plt.show() as the final expression.

Based on the summary stats, all three are likely skewed, so a side-by-side histogram view is a good first check.

Plot feature distributions
import matplotlib.pyplot as plt
import seaborn as sns

cols = ["Sales", "Profit", "Shipping Cost"]

fig, axes = plt.subplots(1, 3, figsize=(18, 4))

for ax, col in zip(axes, cols):
    sns.histplot(df[col], bins=40, kde=True, ax=ax)
    ax.set_title(f"Distribution of {col}")
    ax.set_xlabel(col)
    ax.set_ylabel("Count")

plt.tight_layout()
plt.show()
Summarize distribution statistics
distribution_stats = df[["Sales", "Profit", "Shipping Cost"]].agg(
    ["count", "mean", "std", "min", "median", "max"]
).T

distribution_stats
count mean std min median max
Sales 51290.0 246.490581 487.565361 0.444 85.053 22638.480
Profit 51290.0 28.610982 174.340972 -6599.978 9.240 8399.976
Shipping Cost 51290.0 26.375915 57.296804 0.000 7.790 933.570

The distribution plots for Sales, Profit, and Shipping Cost are supported by the summary values and point to three different shapes:

  • Sales is strongly right-skewed: the median is only 85.053 USD while the mean is 246.49 USD, with a maximum of 22,638.48 USD. That means most orders are relatively small, with a smaller number of very large sales stretching the distribution to the right.
  • Profit is much more spread out and includes both gains and losses: the median is 9.24 USD, the mean is 28.61 USD, the minimum is -6,599.978 USD, and the maximum is 8,399.976 USD. So the distribution is not only skewed, but also has substantial negative values, which suggests a long left tail from loss-making orders and a long right tail from highly profitable ones.
  • Shipping Cost is also right-skewed: the median is 7.79 USD versus a mean of 26.38 USD, with values reaching 933.57 USD. Most shipments appear low-cost, with a smaller number of expensive shipments creating the long right tail.

So, the plotted distributions are consistent with a dataset where Sales and Shipping Cost are dominated by many low-to-moderate values plus a few extreme high values, while Profit is more volatile because it spans both negative and positive outcomes.

Browse All Analysis Examples