How to Analyze a CSV File in Python

A step-by-step AI data analyst session: load a CSV, inspect structure, handle missing values, and generate a full exploratory summary.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the Superstore Sales CSV from a URL and inspects its structure with shape, column dtypes, and a preview of the first rows. It checks for missing values and reports counts by column, then computes summary statistics for numeric fields. It generates distribution plots for Sales, Profit, and Shipping Cost to support exploratory analysis.

Who this example is for

This is for analysts and students who need a repeatable way to profile a new CSV dataset in Python. It helps anyone who wants an AI-assisted notebook that produces both tabular summaries and basic distribution visualizations.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset shape, dtypes, and first 5 rows
  • Missing value counts by column
  • Summary statistics for numeric columns via describe()
  • Histograms with KDE for Sales, Profit, and Shipping Cost

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • requests

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows
  2. 2check for missing values and show counts by column
  3. 3show summary statistics for all numeric columns
  4. 4plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
glm-5.19/102/22/23/31/21/1Open glm-5.1 conversation
gpt-5.49/102/22/23/31/21/1Open gpt-5.4 conversation
qwen3-coder-next9/102/22/23/31/21/1Open qwen3-coder-next conversation
qwen3.5:397b4/101/21/21/31/20/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task How to Analyze a CSV File in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load the CSV file https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv and show shape, dtypes and first 5 rows
Thinking

Refining code structure

I need to keep my approach iterative, loading and inspecting data in one go. I should return just one code block that shows shape, data types, and the head of the DataFrame, likely using pandas. I shouldn’t combine tables, so if multiple tables are needed, I should display them sequentially. I might create a shape_df or use display to show the DataFrame effectively. Keeping it concise with no extra questions is key! Let's get this done!

Here’s the load-and-inspect step.

Load and inspect dataset
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/superstore-sales/superstore_dataset2011-2015.csv"

df = pd.read_csv(url, encoding="latin1")

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(df.dtypes.rename("dtype").reset_index().rename(columns={"index": "column"}))
df.head()
rows columns
0 51290 24
column dtype
0 Row ID int64
1 Order ID str
2 Order Date str
3 Ship Date str
4 Ship Mode str
5 Customer ID str
6 Customer Name str
7 Segment str
8 City str
9 State str
10 Country str
11 Postal Code float64
12 Market str
13 Region str
14 Product ID str
15 Category str
16 Sub-Category str
17 Product Name str
18 Sales float64
19 Quantity int64
20 Discount float64
21 Profit float64
22 Shipping Cost float64
23 Order Priority str
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment City State ... Product ID Category Sub-Category Product Name Sales Quantity Discount Profit Shipping Cost Order Priority
0 42433 AG-2011-2040 1/1/2011 6/1/2011 Standard Class TB-11280 Toby Braunhardt Consumer Constantine Constantine ... OFF-TEN-10000025 Office Supplies Storage Tenex Lockers, Blue 408.300 2 0.0 106.140 35.46 Medium
1 22253 IN-2011-47883 1/1/2011 8/1/2011 Standard Class JH-15985 Joseph Holt Consumer Wagga Wagga New South Wales ... OFF-SU-10000618 Office Supplies Supplies Acme Trimmer, High Speed 120.366 3 0.1 36.036 9.72 Medium
2 48883 HU-2011-1220 1/1/2011 5/1/2011 Second Class AT-735 Annie Thurman Consumer Budapest Budapest ... OFF-TEN-10001585 Office Supplies Storage Tenex Box, Single Width 66.120 4 0.0 29.640 8.17 High
3 11731 IT-2011-3647632 1/1/2011 5/1/2011 Second Class EM-14140 Eugene Moren Home Office Stockholm Stockholm ... OFF-PA-10001492 Office Supplies Paper Enermax Note Cards, Premium 44.865 3 0.5 -26.055 4.82 High
4 22255 IN-2011-47883 1/1/2011 8/1/2011 Standard Class JH-15985 Joseph Holt Consumer Wagga Wagga New South Wales ... FUR-FU-10003447 Furniture Furnishings Eldon Light Bulb, Duo Pack 113.670 5 0.1 37.770 4.70 Medium

5 rows × 24 columns

The file contents are visible and match the requested inspection.

  • The dataset has 51,290 rows and 24 columns.
  • The schema is mixed:
    • numeric columns include Row ID, Postal Code, Sales, Quantity, Discount, Profit, and Shipping Cost
    • the rest are text-like fields such as order, customer, location, and product information
  • Order Date and Ship Date are currently loaded as strings, not datetime types yet.
  • The first 5 rows show this is a transactional sales table with identifiers, customer/location fields, product details, and sales metrics.

From the observed dtypes:

  • int64: Row ID, Quantity
  • float64: Postal Code, Sales, Discount, Profit, Shipping Cost
  • string/object-like: all remaining columns, including Order ID, Customer ID, Category, Sub-Category, and Order Priority

The first 5 rows confirm columns such as: Row ID, Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer Name, Segment, City, State, Country, Market, Region, Product ID, Category, Sub-Category, Product Name, Sales, Quantity, Discount, Profit, Shipping Cost, and Order Priority.

check for missing values and show counts by column
Thinking

Formulating code for missing values

I’m preparing to answer the user about checking missing values and showing counts by column. I'll need to confirm if the DataFrame exists; the notebook state variables only show limited info. The output suggests the DataFrame was created, so it's probably safe to include code that uses it. I aim to provide a concise introduction and the code in one block, focusing on displaying missing counts for all columns, or just those with more than zero missing values. I'll finalize it with a simple display command.Creating a concise missing values table

In the notebook, I want to display a focused and minimal result. I’ll start by calculating the missing values using df.isna().sum() and rename the count, then reset the index to get the correct format. I’ll also compute the percentage of missing values by dividing the count by the total number of rows in the DataFrame, rounded to four decimal places. Finally, I’ll sort the table by missing count and present it cleanly, ensuring everything fits into a single table without extra prints.

Here’s the missing-value count by column.

Summarize missing values
missing = df.isna().sum().rename("missing_count").reset_index().rename(columns={"index": "column"})
missing["missing_pct"] = (missing["missing_count"] / len(df)).round(4)
missing.sort_values("missing_count", ascending=False).reset_index(drop=True)
column missing_count missing_pct
0 Postal Code 41296 0.8051
1 Row ID 0 0.0000
2 Region 0 0.0000
3 Shipping Cost 0 0.0000
4 Profit 0 0.0000
5 Discount 0 0.0000
6 Quantity 0 0.0000
7 Sales 0 0.0000
8 Product Name 0 0.0000
9 Sub-Category 0 0.0000
10 Category 0 0.0000
11 Product ID 0 0.0000
12 Market 0 0.0000
13 Order ID 0 0.0000
14 Country 0 0.0000
15 State 0 0.0000
16 City 0 0.0000
17 Segment 0 0.0000
18 Customer Name 0 0.0000
19 Customer ID 0 0.0000
20 Ship Mode 0 0.0000
21 Ship Date 0 0.0000
22 Order Date 0 0.0000
23 Order Priority 0 0.0000
Thinking

Determining the answer path

I'm deciding that we should take answer path A because it outputs sufficient information. It’s important for me to include a final answer comment as well. I'll ensure I provide concise insights based on what I've observed. Also, I need to mention that the only postal code missing is 41296, with a count of 80.51%. All other codes are at 0. I'll avoid including any code in my final response.

The missing-value check shows that only one column has missing data:

  • Postal Code: 41,296 missing values, which is 80.51% of all rows

All other columns have 0 missing values.

So for this dataset, missing-data handling is only needed for Postal Code; the rest of the columns are complete.

show summary statistics for all numeric columns
Thinking

Planning summary statistics

I'm focusing on providing summary statistics for all numeric columns as the user requested. Since we have the notebook state loaded, I’ll create one concise code block that uses the describe function on numeric columns. I don't think I need to ask for anything more. I should also mention that the numeric columns might include something like Row ID. I’ll make sure to display the results in a readable table format. Let's keep it minimal and clear!

Here are summary statistics for all numeric columns.

Summarize numeric columns
numeric_summary = df.select_dtypes(include="number").describe().T
numeric_summary
count mean std min 25% 50% 75% max
Row ID 51290.0 25645.500000 14806.291990 1.000 12823.250000 25645.500 38467.7500 51290.000
Postal Code 9994.0 55190.379428 32063.693350 1040.000 23223.000000 56430.500 90008.0000 99301.000
Sales 51290.0 246.490581 487.565361 0.444 30.758625 85.053 251.0532 22638.480
Quantity 51290.0 3.476545 2.278766 1.000 2.000000 3.000 5.0000 14.000
Discount 51290.0 0.142908 0.212280 0.000 0.000000 0.000 0.2000 0.850
Profit 51290.0 28.610982 174.340972 -6599.978 0.000000 9.240 36.8100 8399.976
Shipping Cost 51290.0 26.375915 57.296804 0.000 2.610000 7.790 24.4500 933.570

Here are the key summary statistics from the numeric columns:

  • Row ID ranges from 1 to 51,290 with a mean of 25,645.5, which looks like a sequential identifier rather than a business measure.
  • Postal Code has only 9,994 non-missing values, confirming it is heavily incomplete; among available values, the median is 56,430.5.
  • Sales are strongly right-skewed: median sales are 85.05, while mean sales are 246.49, and the maximum reaches 22,638.48. That gap suggests a small number of very large orders.
  • Quantity is fairly concentrated: median 3, interquartile range from 2 to 5, with a maximum of 14.
  • Discount is usually low: the 25th and 50th percentiles are both 0, and the mean is 0.143, so many orders had no discount.
  • Profit shows substantial spread: median profit is 0, mean profit is 28.61, maximum is 8,399.98 USD, and minimum is -6,599.98 USD. This indicates many low-profit or break-even orders, with some large gains and losses.
  • Shipping Cost is also right-skewed: median 7.79, mean 26.38, and maximum 933.57, so a minority of shipments are much more expensive than typical.

Overall, the numeric summary shows:

  • one identifier-like numeric field (Row ID),
  • one numeric field with many missing values (Postal Code),
  • and several business metrics (Sales, Quantity, Discount, Profit, Shipping Cost) with noticeable skew and outliers, especially for sales, profit, and shipping cost.
plot distributions for the top 3 numeric columns Sales, Profit, Shipping Cost
Thinking

Considering histogram options

I'm exploring the best way to represent a summary with strong skewness. I think using histograms with KDE might be effective, but I need to avoid altering the data. A log-scaled x-axis for profit won't work due to negative values. So, separate histograms with many bins could be better, especially for right-skewed distributions like Sales and Shipping Cost. I’ll create a single figure with three histograms, including titles and axis labels, while keeping it concise in notebook style.

These three are all skewed, so a side-by-side histogram view is a good first look.

Plot feature distributions
import matplotlib.pyplot as plt
import seaborn as sns

cols = ["Sales", "Profit", "Shipping Cost"]

fig, axes = plt.subplots(1, 3, figsize=(18, 4))

for ax, col in zip(axes, cols):
    sns.histplot(data=df, x=col, bins=50, ax=ax, color="steelblue")
    ax.set_title(f"Distribution of {col}")
    ax.set_xlabel(col)
    ax.set_ylabel("Count")

plt.tight_layout()
plt.show()
Summarize sales profit shipping stats
df[["Sales", "Profit", "Shipping Cost"]].describe().T[["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]
count mean std min 25% 50% 75% max
Sales 51290.0 246.490581 487.565361 0.444 30.758625 85.053 251.0532 22638.480
Profit 51290.0 28.610982 174.340972 -6599.978 0.000000 9.240 36.8100 8399.976
Shipping Cost 51290.0 26.375915 57.296804 0.000 2.610000 7.790 24.4500 933.570

The distributions for Sales, Profit, and Shipping Cost are now plotted, and the summary stats help interpret what those histograms show:

  • Sales is strongly right-skewed. The median is about 85.05, while the mean is much higher at 246.49, with a maximum of 22638.48. That pattern indicates many smaller orders and a relatively small number of very large sales pulling the distribution to the right.
  • Profit is also right-skewed, but with an important twist: it includes negative values. The median profit is 9.24, the mean is 28.61, the minimum is -6599.98 USD, and the maximum is 8399.98 USD. So most orders appear to have modest profit, but there are both large gains and some substantial losses.
  • Shipping Cost is right-skewed as well. The median is 7.79, the mean is 26.38, and the maximum is 933.57. That suggests most shipments are inexpensive, with a smaller number of much more expensive shipments creating a long upper tail.

Overall, the three histograms likely show:

  • a heavy concentration near lower values,
  • long right tails,
  • and for Profit, a spread crossing below zero.

These are typical non-normal business distributions with outliers, especially for transaction-level data.

Score cards for gpt-5.4 for How to Analyze a CSV File in Python

Total score: 9/10

  • Task Completion: 2/2

    Excellent

    All requested steps were completed: load/inspect the CSV, check missing values, show numeric summary statistics, and plot the three requested distributions.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically valid and consistent with the outputs provided, including the dataframe load, missing-value table, describe output, and histogram plotting.

  • Output Quality: 3/3

    Excellent

    The outputs match the expected outcomes semantically: shape, dtypes, head, missing-value counts, numeric describe statistics, and three histograms are all present and relevant.

  • Reasoning Quality: 1/2

    Needs work

    The narrative is mostly correct and helpful, but it contains some inconsistencies and over-interpretation, such as mentioning a median profit of 9.24 while earlier describing profit as right-skewed with median 0 in the summary table.

  • Reliability: 1/1

    Excellent

    The workflow is generally consistent and reproducible, though it includes some contradictory commentary and a truncated final explanation, which slightly reduces robustness.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.