Exploratory Data Analysis

E-commerce Sales Analysis in Python

Explore an e-commerce sales dataset with monthly trends, top products, category breakdowns, and average order value analysis.

What

This AI Data Analyst workflow loads the Online Retail Excel dataset from a URL and summarizes its structure with shape, dtypes, and missing values. It generates exploratory visuals for monthly revenue trends and identifies the top 10 products by total revenue. It computes average order value and compares AOV across countries with a focused country-level breakdown.

Who

This is for analysts and data scientists who want a conversational, code-generating EDA template for e-commerce transaction data in Python. It helps users quickly validate data quality and produce standard sales KPIs and plots for reporting or further modeling.

Tools

pandas
numpy
matplotlib
seaborn
openpyxl

Outcomes

Dataset shape, dtypes, and null counts
Monthly revenue trend line chart
Top 10 products by total revenue bar chart
Average order value overall and by country
Bar chart of top countries by AOV

Quality Score

1/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 0/2
Needs work
The workflow only attempted the first step (loading data and showing info/shape) and did not produce results; none of the required EDA plots or AOV-by-country analysis were attempted.
Execution Correctness: 0/2
Needs work
The code fails at `pd.read_excel(url)` with an HTTPError due to an incorrect URL/path, so nothing downstream (info/shape) executes.
Output Quality: 0/3
Needs work
No valid outputs were generated (only an exception traceback), so expected outcomes like dtypes/null counts and the specified charts are missing.
Reasoning Quality: 1/2
Needs work
The plan to use pandas and `df.info()`/`df.shape` is reasonable, but it misses key robustness steps (verifying the provided URL, handling download errors) and does not address the remaining required analyses.
Reliability: 0/1
Needs work
The workflow is fragile because it uses an incorrect dataset URL and provides no fallback (e.g., corrected link, retry with the exact provided URL, or error handling).

load the retail sales dataset from https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/online-retail/Online-Retail.xlsx and show basic info and shape

Thinking

Considering table display

I’m thinking about how to handle multiple tables. They mentioned that if several are needed, I should display them sequentially using display(...). That sounds like a good plan. I also need to remember to import pandas for data handling. Keeping the code minimal is key! Let’s make sure everything flows smoothly while adhering to those guidelines. I’m excited to see how it all comes together!

Let’s start by loading the file and inspecting the dataset shape, columns, and first rows.

Load and inspect dataset

import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/online-retail/Online-Retail.xlsx"
df = pd.read_excel(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column_name": df.columns}))
df.head()

	rows	columns
0	541909	8

	column_name
0	InvoiceNo
1	StockCode
2	Description
3	Quantity
4	InvoiceDate
5	UnitPrice
6	CustomerID
7	Country

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom

The dataset is loaded and the basic structure is clear from the notebook output:

It has 541,909 rows and 8 columns.
The columns are:
- InvoiceNo
- StockCode
- Description
- Quantity
- InvoiceDate
- UnitPrice
- CustomerID
- Country

From the first few rows, this looks like transaction-level retail data where each row is one purchased item line. You can already see:

invoice identifiers in InvoiceNo
product identifiers in StockCode
product names in Description
item counts in Quantity
purchase timestamps in InvoiceDate
per-item prices in UnitPrice
customer identifiers in CustomerID
customer location in Country

The sample rows shown are all from United Kingdom, and InvoiceDate was parsed correctly as a datetime field.

plot monthly revenue trend

#10

Thinking

Planning monthly revenue plot

I need to respond with just one concise step. The user has asked for a plot of the monthly revenue trend, so I should compute the monthly revenue and generate the chart in one go. Revenue could be calculated as Quantity times UnitPrice, while potentially also factoring in returns. I’ll plot raw monthly net revenue, using InvoiceDate and the Revenue column but keeping it straightforward. For the code, I’ll ensure the final expression ends with either ax or plt.show.Creating monthly revenue plot

I should provide a single code block that ends with one final expression, likely just a plot. I can aggregate the data into a temporary variable called monthly_revenue. While I could use df.assign to create a revenue column, it’s better to avoid mutating the original DataFrame. Instead, I’ll create monthly_revenue using the assigned Revenue and Month, using dt.to_period('M').dt.to_timestamp(). I’ll include a brief explanation along with the code.

#11

Here’s a simple monthly net revenue trend using Quantity * UnitPrice. This includes returns/cancellations if they exist, so negative values will reduce revenue.

#12

Plot monthly revenue trend

import seaborn as sns
import matplotlib.pyplot as plt

monthly_revenue = (
    df.assign(Revenue=df["Quantity"] * df["UnitPrice"])
      .groupby(df["InvoiceDate"].dt.to_period("M"))["Revenue"]
      .sum()
      .reset_index()
)

monthly_revenue["InvoiceDate"] = monthly_revenue["InvoiceDate"].dt.to_timestamp()

plt.figure(figsize=(12, 5))
ax = sns.lineplot(data=monthly_revenue, x="InvoiceDate", y="Revenue", marker="o")
ax.set_title("Monthly Revenue Trend")
ax.set_xlabel("Month")
ax.set_ylabel("Revenue")
plt.xticks(rotation=45)
plt.tight_layout()
ax

#13

<Axes: title={'center': 'Monthly Revenue Trend'}, xlabel='Month', ylabel='Revenue'>

#14

#15

Thinking

Crafting concise code insights

I'm thinking the code should focus on displaying values clearly, perhaps rounding revenue to make it neat. I wonder if using copy in the presentation would enhance clarity. By crafting insight along with the code, I can keep everything concise and easy to understand. My goal is to provide the user with a clear and helpful output while ensuring that all pieces fit together smoothly. Let's make it effective and straightforward!

#16

The line chart rendered correctly, and the notebook state shows monthly_revenue already contains 13 monthly points, so the monthly aggregation worked. To make the trend interpretable in numbers as well, here are the values behind that plot:

#17

Sort monthly revenue

display(monthly_revenue.sort_values("InvoiceDate").reset_index(drop=True))

#18

	InvoiceDate	Revenue
0	2010-12-01	748957.020
1	2011-01-01	560000.260
2	2011-02-01	498062.650
3	2011-03-01	683267.080
4	2011-04-01	493207.121
5	2011-05-01	723333.510
6	2011-06-01	691123.120
7	2011-07-01	681300.111
8	2011-08-01	682680.510
9	2011-09-01	1019687.622
10	2011-10-01	1070704.670
11	2011-11-01	1461756.250
12	2011-12-01	433668.010

#19

The monthly revenue trend is clear from the plotted series and the month-level values behind it.

Revenue starts high in December 2010 at 748,957.02 USD, drops through January–April 2011 with a low point around April 2011 at 493,207.121 USD, then recovers through the middle of the year. The strongest growth happens in the fall:

September 2011: 1,019,687.622 USD
October 2011: 1,070,704.670 USD
November 2011: 1,461,756.250 USD — the peak month

After that, revenue falls sharply in December 2011 to 433,668.010 USD, which is much lower than November. Since the dataset only contains one month-level point for December 2011, that drop is often due to partial-month data rather than a true collapse in sales.

So the plotted monthly revenue trend shows:

an early decline after December 2010,
relatively steady mid-year revenue around 680k–723k USD,
a strong surge in September to November 2011,
and a final December dip that likely reflects incomplete coverage.

#20

show top 10 products by total revenue

#21

Thinking

Preparing to show top products by revenue

I need to answer the user who asked for the top 10 products by total revenue. I know I have the dataframe ready, so I’ll create a concise code block for that. First, I'll compute revenue as Quantity times UnitPrice. It's likely I should group by Description or a combination of StockCode and Description to avoid duplicates. However, I’ll need to decide whether to include returns or cancellations since they result in negative quantities. This is all about aggregating net revenue based on the request!Crafting code for top products

I want to use both Description and StockCode since duplicates can vary, keeping things simple. The developer recommends one table per step, so I’ll use display(top_products) to show the results. It seems better to end without an additional expression, focusing on just that display. I’ll aim for a minimal code block and create a concise explanation to go along with it. Keeping everything tidy and straightforward is key here!

#22

Let’s aggregate net revenue by product and display the top 10. I’m grouping by both StockCode and Description so similarly named items stay distinct if needed.

#23

Find top products by revenue

top_products = (
    df.assign(Revenue=df["Quantity"] * df["UnitPrice"])
      .dropna(subset=["Description"])
      .groupby(["StockCode", "Description"], as_index=False)["Revenue"]
      .sum()
      .sort_values("Revenue", ascending=False)
      .head(10)
      .reset_index(drop=True)
)

display(top_products)

#24

	StockCode	Description	Revenue
0	DOT	DOTCOM POSTAGE	206245.48
1	22423	REGENCY CAKESTAND 3 TIER	164762.19
2	47566	PARTY BUNTING	98302.98
3	85123A	WHITE HANGING HEART T-LIGHT HOLDER	97715.99
4	85099B	JUMBO BAG RED RETROSPOT	92356.03
5	23084	RABBIT NIGHT LIGHT	66756.59
6	POST	POSTAGE	66230.64
7	22086	PAPER CHAIN KIT 50'S CHRISTMAS	63791.94
8	84879	ASSORTED COLOUR BIRD ORNAMENT	58959.73
9	79321	CHILLI LIGHTS	53768.06

#25

The top 10 products by total revenue are led by a mix of actual items and service-related charges.

DOTCOM POSTAGE (DOT) — 206,245.48 USD
REGENCY CAKESTAND 3 TIER (22423) — 164,762.19 USD
PARTY BUNTING (47566) — 98,302.98 USD
WHITE HANGING HEART T-LIGHT HOLDER (85123A) — 97,715.99 USD
JUMBO BAG RED RETROSPOT (85099B) — 92,356.03 USD
RABBIT NIGHT LIGHT (23084) — 66,756.59 USD
POSTAGE (POST) — 66,230.64 USD
PAPER CHAIN KIT 50'S CHRISTMAS (22086) — 63,791.94 USD
ASSORTED COLOUR BIRD ORNAMENT (84879) — 58,959.73 USD
CHILLI LIGHTS (79321) — 53,768.06 USD

A few useful takeaways from this ranking:

DOTCOM POSTAGE is the highest revenue entry by a wide margin, so shipping/service charges are materially contributing to revenue.
Among physical products, REGENCY CAKESTAND 3 TIER is the strongest performer.
Home décor, gift, and seasonal party items dominate the list, suggesting those categories are major revenue drivers in this dataset.

#26

what is the average order value and how does it vary by country?

#27

Thinking

Considering order calculations

I'm thinking about compiling a single table that shows overall averages, potentially including country averages and order totals. I might sort the display in descending order and include a plain language overall average order value. However, I can't know the output until I run the code, so I need to provide an explanation alongside a code block. I need to decide on whether to include canceled orders or exclude negative quantities, as AOV is typically based on completed orders.

I'll compute order value as invoice-level revenue and focus on positive order totals only. I need to create order_totals, overall_aov, and country_aov sorted, then stop after the code.Finalizing table displays

I'm considering the final output and whether to use multiple tables or just one. The developer mentioned that multiple tables can be displayed sequentially, but maybe one combined table would be more effective. I could create a country_aov table and include an overall row labeled "ALL" in the first row. This way, it can still be presented as one table, which seems like a nice solution! Let's go with that approach.

#28

I’d calculate order value at the invoice level, then average those invoice totals by country. To keep returns/cancellations from skewing AOV, I’m excluding invoices with non-positive total revenue.

#29

Summarize average order value by country

order_totals = (
    df.assign(Revenue=df["Quantity"] * df["UnitPrice"])
      .groupby(["InvoiceNo", "Country"], as_index=False)["Revenue"]
      .sum()
)

order_totals = order_totals[order_totals["Revenue"] > 0].copy()

country_aov = (
    order_totals.groupby("Country", as_index=False)
      .agg(
          average_order_value=("Revenue", "mean"),
          orders=("InvoiceNo", "nunique")
      )
      .sort_values("average_order_value", ascending=False)
      .reset_index(drop=True)
)

overall_row = pd.DataFrame({
    "Country": ["Overall"],
    "average_order_value": [order_totals["Revenue"].mean()],
    "orders": [order_totals["InvoiceNo"].nunique()]
})

aov_summary = pd.concat([overall_row, country_aov], ignore_index=True)
display(aov_summary.head(15))

#30

	Country	average_order_value	orders
0	Overall	534.403033	19960
1	Singapore	3039.898571	7
2	Netherlands	3036.663191	94
3	Australia	2430.198421	57
4	Japan	1969.282632	19
5	Lebanon	1693.880000	1
6	Hong Kong	1426.527273	11
7	Brazil	1143.600000	1
8	Sweden	1066.064722	36
9	Switzerland	1057.220370	54
10	Denmark	1053.074444	18
11	Israel	1016.907500	8
12	Norway	1004.595556	36
13	RSA	1002.310000	1
14	EIRE	984.215139	288

#31

Thinking

Analyzing output data

I need to base the answer on the notebook output. I should include a final answer comment without any code. Overall AOV is about 534.40 USD across 19,960 orders, with notable variations by country. For instance, Singapore leads at 3,039.90 USD, followed closely by the Netherlands at 3,036.66 USD and Australia at 2,430.20 USD. However, countries like Lebanon and Brazil have unreliable single-order data. Also, the UK is not shown as it's not in the top 15. I need to evaluate if these outputs sufficiently address the user's question.Evaluating output sufficiency

Since the output only includes a head of 15, it might not be sufficient to fully address how AOV varies across all countries. I think it's still possible to mention insights from the shown countries, but I need to consider if my answer meets the original goal of providing a general overview. The top countries are represented, but excluding significant ones like the UK could be a problem. To fully answer the question, I should gather more data and include a minimal code block for clarity.Determining next steps

I need to take the next step that directly addresses the goal. Displaying the full country AOV along with the bottom 10 would be helpful since the user wants to understand variation by country. I have aov_summary in state, which could be useful. I think showing the overall AOV plus the top and bottom 10, while excluding countries with very few orders, could provide clarity. However, to truly understand variation, I should also add a note about the order count filter. Showing all countries sorted in descending order might be what's needed.

#32

The current result already gives the overall average order value and shows that AOV differs a lot across countries.

From the observed output:

The overall average order value is 534.40 USD across 19,960 orders.
Among the countries shown, AOV is much higher than the overall average in several markets:
- Singapore: 3039.90 USD across 7 orders
- Netherlands: 3036.66 USD across 94 orders
- Australia: 2430.20 USD across 57 orders
- Japan: 1969.28 USD across 19 orders
Some very high AOV values come from countries with only 1 order shown, such as Lebanon, Brazil, and RSA, so those are less stable than countries with larger order counts.
This suggests a strong country effect: some international markets place much larger orders on average, while the global average is pulled down by countries with smaller basket sizes.

To fully answer how AOV varies by country, the missing piece is the complete country-level table, not just the top 15.

#33

Display country aov

display(country_aov)

#34

	Country	average_order_value	orders
0	Singapore	3039.898571	7
1	Netherlands	3036.663191	94
2	Australia	2430.198421	57
3	Japan	1969.282632	19
4	Lebanon	1693.880000	1
5	Hong Kong	1426.527273	11
6	Brazil	1143.600000	1
7	Sweden	1066.064722	36
8	Switzerland	1057.220370	54
9	Denmark	1053.074444	18
10	Israel	1016.907500	8
11	Norway	1004.595556	36
12	RSA	1002.310000	1
13	EIRE	984.215139	288
14	Greece	952.104000	5
15	Cyprus	849.398750	16
16	Channel Islands	786.555385	26
17	USA	716.078000	5
18	Spain	684.190111	90
19	United Arab Emirates	634.093333	3
20	Iceland	615.714286	7
21	Canada	611.063333	6
22	Austria	599.922353	17
23	Portugal	581.846552	58
24	Finland	549.904390	41
25	Malta	545.118000	5
26	France	534.987526	392
27	United Kingdom	500.872528	18019
28	Germany	500.803370	457
29	Italy	460.085263	38
30	Belgium	420.370816	98
31	Lithuania	415.265000	4
32	Czech Republic	413.370000	2
33	Poland	386.034211	19
34	Unspecified	365.368462	13
35	European Community	325.062500	4
36	Bahrain	251.380000	3
37	Saudi Arabia	145.920000	1

#35

The average order value is 534.40 USD overall, based on 19,960 orders.

It varies a lot by country. The highest average order values in the observed output are:

Singapore: 3039.90 USD across 7 orders
Netherlands: 3036.66 USD across 94 orders
Australia: 2430.20 USD across 57 orders
Japan: 1969.28 USD across 19 orders
Hong Kong: 1426.53 USD across 11 orders
Sweden: 1066.06 USD across 36 orders
Switzerland: 1057.22 USD across 54 orders
Denmark: 1053.07 USD across 18 orders
EIRE: 984.22 USD across 288 orders

A few countries also show very high AOV with only 1 order:

Lebanon: 1693.88 USD
Brazil: 1143.60 USD
RSA: 1002.31 USD

Those single-order countries should be treated cautiously because one order can make the average look unusually high.

The strongest takeaway is that AOV is much higher in several international markets than the overall average, especially Singapore, the Netherlands, and Australia. Among countries with a more meaningful number of orders, the Netherlands, Australia, Japan, Switzerland, Sweden, and EIRE stand out as having relatively high order values.

E-commerce Sales Analysis in Python

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics