Exploratory Data Analysis
E-commerce Sales Analysis in Python
Explore an e-commerce sales dataset with monthly trends, top products, category breakdowns, and average order value analysis.
What
This AI Data Analyst workflow loads the Online Retail Excel dataset from a URL and summarizes its structure with shape, dtypes, and missing values. It generates exploratory visuals for monthly revenue trends and identifies the top 10 products by total revenue. It computes average order value and compares AOV across countries with a focused country-level breakdown.
Who
This is for analysts and data scientists who want a conversational, code-generating EDA template for e-commerce transaction data in Python. It helps users quickly validate data quality and produce standard sales KPIs and plots for reporting or further modeling.
Tools
- pandas
- numpy
- matplotlib
- seaborn
- openpyxl
Outcomes
- Dataset shape, dtypes, and null counts
- Monthly revenue trend line chart
- Top 10 products by total revenue bar chart
- Average order value overall and by country
- Bar chart of top countries by AOV
Quality Score
1/10
Last scored: Apr 7, 2026
Task Completion: 0/2
Needs workThe workflow only attempted the first step (loading data and showing info/shape) and did not produce results; none of the required EDA plots or AOV-by-country analysis were attempted.
Execution Correctness: 0/2
Needs workThe code fails at `pd.read_excel(url)` with an HTTPError due to an incorrect URL/path, so nothing downstream (info/shape) executes.
Output Quality: 0/3
Needs workNo valid outputs were generated (only an exception traceback), so expected outcomes like dtypes/null counts and the specified charts are missing.
Reasoning Quality: 1/2
Needs workThe plan to use pandas and `df.info()`/`df.shape` is reasonable, but it misses key robustness steps (verifying the provided URL, handling download errors) and does not address the remaining required analyses.
Reliability: 0/1
Needs workThe workflow is fragile because it uses an incorrect dataset URL and provides no fallback (e.g., corrected link, retry with the exact provided URL, or error handling).
Thinking
Considering table display
I’m thinking about how to handle multiple tables. They mentioned that if several are needed, I should display them sequentially using display(...). That sounds like a good plan. I also need to remember to import pandas for data handling. Keeping the code minimal is key! Let’s make sure everything flows smoothly while adhering to those guidelines. I’m excited to see how it all comes together!
Let’s start by loading the file and inspecting the dataset shape, columns, and first rows.
Load and inspect dataset
import pandas as pd
from IPython.display import display
url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/online-retail/Online-Retail.xlsx"
df = pd.read_excel(url)
display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column_name": df.columns}))
df.head()| rows | columns | |
|---|---|---|
| 0 | 541909 | 8 |
| column_name | |
|---|---|
| 0 | InvoiceNo |
| 1 | StockCode |
| 2 | Description |
| 3 | Quantity |
| 4 | InvoiceDate |
| 5 | UnitPrice |
| 6 | CustomerID |
| 7 | Country |
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom |
The dataset is loaded and the basic structure is clear from the notebook output:
- It has 541,909 rows and 8 columns.
- The columns are:
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
From the first few rows, this looks like transaction-level retail data where each row is one purchased item line. You can already see:
- invoice identifiers in
InvoiceNo - product identifiers in
StockCode - product names in
Description - item counts in
Quantity - purchase timestamps in
InvoiceDate - per-item prices in
UnitPrice - customer identifiers in
CustomerID - customer location in
Country
The sample rows shown are all from United Kingdom, and InvoiceDate was parsed correctly as a datetime field.
Thinking
Planning monthly revenue plot
I need to respond with just one concise step. The user has asked for a plot of the monthly revenue trend, so I should compute the monthly revenue and generate the chart in one go. Revenue could be calculated as Quantity times UnitPrice, while potentially also factoring in returns. I’ll plot raw monthly net revenue, using InvoiceDate and the Revenue column but keeping it straightforward. For the code, I’ll ensure the final expression ends with either ax or plt.show.Creating monthly revenue plot
I should provide a single code block that ends with one final expression, likely just a plot. I can aggregate the data into a temporary variable called monthly_revenue. While I could use df.assign to create a revenue column, it’s better to avoid mutating the original DataFrame. Instead, I’ll create monthly_revenue using the assigned Revenue and Month, using dt.to_period('M').dt.to_timestamp(). I’ll include a brief explanation along with the code.
Here’s a simple monthly net revenue trend using Quantity * UnitPrice. This includes returns/cancellations if they exist, so negative values will reduce revenue.
Plot monthly revenue trend
import seaborn as sns
import matplotlib.pyplot as plt
monthly_revenue = (
df.assign(Revenue=df["Quantity"] * df["UnitPrice"])
.groupby(df["InvoiceDate"].dt.to_period("M"))["Revenue"]
.sum()
.reset_index()
)
monthly_revenue["InvoiceDate"] = monthly_revenue["InvoiceDate"].dt.to_timestamp()
plt.figure(figsize=(12, 5))
ax = sns.lineplot(data=monthly_revenue, x="InvoiceDate", y="Revenue", marker="o")
ax.set_title("Monthly Revenue Trend")
ax.set_xlabel("Month")
ax.set_ylabel("Revenue")
plt.xticks(rotation=45)
plt.tight_layout()
ax<Axes: title={'center': 'Monthly Revenue Trend'}, xlabel='Month', ylabel='Revenue'>Thinking
Crafting concise code insights
I'm thinking the code should focus on displaying values clearly, perhaps rounding revenue to make it neat. I wonder if using copy in the presentation would enhance clarity. By crafting insight along with the code, I can keep everything concise and easy to understand. My goal is to provide the user with a clear and helpful output while ensuring that all pieces fit together smoothly. Let's make it effective and straightforward!
The line chart rendered correctly, and the notebook state shows monthly_revenue already contains 13 monthly points, so the monthly aggregation worked. To make the trend interpretable in numbers as well, here are the values behind that plot:
Sort monthly revenue
display(monthly_revenue.sort_values("InvoiceDate").reset_index(drop=True))| InvoiceDate | Revenue | |
|---|---|---|
| 0 | 2010-12-01 | 748957.020 |
| 1 | 2011-01-01 | 560000.260 |
| 2 | 2011-02-01 | 498062.650 |
| 3 | 2011-03-01 | 683267.080 |
| 4 | 2011-04-01 | 493207.121 |
| 5 | 2011-05-01 | 723333.510 |
| 6 | 2011-06-01 | 691123.120 |
| 7 | 2011-07-01 | 681300.111 |
| 8 | 2011-08-01 | 682680.510 |
| 9 | 2011-09-01 | 1019687.622 |
| 10 | 2011-10-01 | 1070704.670 |
| 11 | 2011-11-01 | 1461756.250 |
| 12 | 2011-12-01 | 433668.010 |
The monthly revenue trend is clear from the plotted series and the month-level values behind it.
Revenue starts high in December 2010 at 748,957.02 USD, drops through January–April 2011 with a low point around April 2011 at 493,207.121 USD, then recovers through the middle of the year. The strongest growth happens in the fall:
- September 2011: 1,019,687.622 USD
- October 2011: 1,070,704.670 USD
- November 2011: 1,461,756.250 USD — the peak month
After that, revenue falls sharply in December 2011 to 433,668.010 USD, which is much lower than November. Since the dataset only contains one month-level point for December 2011, that drop is often due to partial-month data rather than a true collapse in sales.
So the plotted monthly revenue trend shows:
- an early decline after December 2010,
- relatively steady mid-year revenue around 680k–723k USD,
- a strong surge in September to November 2011,
- and a final December dip that likely reflects incomplete coverage.
Thinking
Preparing to show top products by revenue
I need to answer the user who asked for the top 10 products by total revenue. I know I have the dataframe ready, so I’ll create a concise code block for that. First, I'll compute revenue as Quantity times UnitPrice. It's likely I should group by Description or a combination of StockCode and Description to avoid duplicates. However, I’ll need to decide whether to include returns or cancellations since they result in negative quantities. This is all about aggregating net revenue based on the request!Crafting code for top products
I want to use both Description and StockCode since duplicates can vary, keeping things simple. The developer recommends one table per step, so I’ll use display(top_products) to show the results. It seems better to end without an additional expression, focusing on just that display. I’ll aim for a minimal code block and create a concise explanation to go along with it. Keeping everything tidy and straightforward is key here!
Let’s aggregate net revenue by product and display the top 10. I’m grouping by both StockCode and Description so similarly named items stay distinct if needed.
Find top products by revenue
top_products = (
df.assign(Revenue=df["Quantity"] * df["UnitPrice"])
.dropna(subset=["Description"])
.groupby(["StockCode", "Description"], as_index=False)["Revenue"]
.sum()
.sort_values("Revenue", ascending=False)
.head(10)
.reset_index(drop=True)
)
display(top_products)| StockCode | Description | Revenue | |
|---|---|---|---|
| 0 | DOT | DOTCOM POSTAGE | 206245.48 |
| 1 | 22423 | REGENCY CAKESTAND 3 TIER | 164762.19 |
| 2 | 47566 | PARTY BUNTING | 98302.98 |
| 3 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 97715.99 |
| 4 | 85099B | JUMBO BAG RED RETROSPOT | 92356.03 |
| 5 | 23084 | RABBIT NIGHT LIGHT | 66756.59 |
| 6 | POST | POSTAGE | 66230.64 |
| 7 | 22086 | PAPER CHAIN KIT 50'S CHRISTMAS | 63791.94 |
| 8 | 84879 | ASSORTED COLOUR BIRD ORNAMENT | 58959.73 |
| 9 | 79321 | CHILLI LIGHTS | 53768.06 |
The top 10 products by total revenue are led by a mix of actual items and service-related charges.
- DOTCOM POSTAGE (
DOT) — 206,245.48 USD - REGENCY CAKESTAND 3 TIER (
22423) — 164,762.19 USD - PARTY BUNTING (
47566) — 98,302.98 USD - WHITE HANGING HEART T-LIGHT HOLDER (
85123A) — 97,715.99 USD - JUMBO BAG RED RETROSPOT (
85099B) — 92,356.03 USD - RABBIT NIGHT LIGHT (
23084) — 66,756.59 USD - POSTAGE (
POST) — 66,230.64 USD - PAPER CHAIN KIT 50'S CHRISTMAS (
22086) — 63,791.94 USD - ASSORTED COLOUR BIRD ORNAMENT (
84879) — 58,959.73 USD - CHILLI LIGHTS (
79321) — 53,768.06 USD
A few useful takeaways from this ranking:
- DOTCOM POSTAGE is the highest revenue entry by a wide margin, so shipping/service charges are materially contributing to revenue.
- Among physical products, REGENCY CAKESTAND 3 TIER is the strongest performer.
- Home décor, gift, and seasonal party items dominate the list, suggesting those categories are major revenue drivers in this dataset.
Thinking
Considering order calculations
I'm thinking about compiling a single table that shows overall averages, potentially including country averages and order totals. I might sort the display in descending order and include a plain language overall average order value. However, I can't know the output until I run the code, so I need to provide an explanation alongside a code block. I need to decide on whether to include canceled orders or exclude negative quantities, as AOV is typically based on completed orders.
I'll compute order value as invoice-level revenue and focus on positive order totals only. I need to create order_totals, overall_aov, and country_aov sorted, then stop after the code.Finalizing table displays
I'm considering the final output and whether to use multiple tables or just one. The developer mentioned that multiple tables can be displayed sequentially, but maybe one combined table would be more effective. I could create a country_aov table and include an overall row labeled "ALL" in the first row. This way, it can still be presented as one table, which seems like a nice solution! Let's go with that approach.
I’d calculate order value at the invoice level, then average those invoice totals by country. To keep returns/cancellations from skewing AOV, I’m excluding invoices with non-positive total revenue.
Summarize average order value by country
order_totals = (
df.assign(Revenue=df["Quantity"] * df["UnitPrice"])
.groupby(["InvoiceNo", "Country"], as_index=False)["Revenue"]
.sum()
)
order_totals = order_totals[order_totals["Revenue"] > 0].copy()
country_aov = (
order_totals.groupby("Country", as_index=False)
.agg(
average_order_value=("Revenue", "mean"),
orders=("InvoiceNo", "nunique")
)
.sort_values("average_order_value", ascending=False)
.reset_index(drop=True)
)
overall_row = pd.DataFrame({
"Country": ["Overall"],
"average_order_value": [order_totals["Revenue"].mean()],
"orders": [order_totals["InvoiceNo"].nunique()]
})
aov_summary = pd.concat([overall_row, country_aov], ignore_index=True)
display(aov_summary.head(15))| Country | average_order_value | orders | |
|---|---|---|---|
| 0 | Overall | 534.403033 | 19960 |
| 1 | Singapore | 3039.898571 | 7 |
| 2 | Netherlands | 3036.663191 | 94 |
| 3 | Australia | 2430.198421 | 57 |
| 4 | Japan | 1969.282632 | 19 |
| 5 | Lebanon | 1693.880000 | 1 |
| 6 | Hong Kong | 1426.527273 | 11 |
| 7 | Brazil | 1143.600000 | 1 |
| 8 | Sweden | 1066.064722 | 36 |
| 9 | Switzerland | 1057.220370 | 54 |
| 10 | Denmark | 1053.074444 | 18 |
| 11 | Israel | 1016.907500 | 8 |
| 12 | Norway | 1004.595556 | 36 |
| 13 | RSA | 1002.310000 | 1 |
| 14 | EIRE | 984.215139 | 288 |
Thinking
Analyzing output data
I need to base the answer on the notebook output. I should include a final answer comment without any code. Overall AOV is about 534.40 USD across 19,960 orders, with notable variations by country. For instance, Singapore leads at 3,039.90 USD, followed closely by the Netherlands at 3,036.66 USD and Australia at 2,430.20 USD. However, countries like Lebanon and Brazil have unreliable single-order data. Also, the UK is not shown as it's not in the top 15. I need to evaluate if these outputs sufficiently address the user's question.Evaluating output sufficiency
Since the output only includes a head of 15, it might not be sufficient to fully address how AOV varies across all countries. I think it's still possible to mention insights from the shown countries, but I need to consider if my answer meets the original goal of providing a general overview. The top countries are represented, but excluding significant ones like the UK could be a problem. To fully answer the question, I should gather more data and include a minimal code block for clarity.Determining next steps
I need to take the next step that directly addresses the goal. Displaying the full country AOV along with the bottom 10 would be helpful since the user wants to understand variation by country. I have aov_summary in state, which could be useful. I think showing the overall AOV plus the top and bottom 10, while excluding countries with very few orders, could provide clarity. However, to truly understand variation, I should also add a note about the order count filter. Showing all countries sorted in descending order might be what's needed.
The current result already gives the overall average order value and shows that AOV differs a lot across countries.
From the observed output:
- The overall average order value is 534.40 USD across 19,960 orders.
- Among the countries shown, AOV is much higher than the overall average in several markets:
- Singapore: 3039.90 USD across 7 orders
- Netherlands: 3036.66 USD across 94 orders
- Australia: 2430.20 USD across 57 orders
- Japan: 1969.28 USD across 19 orders
- Some very high AOV values come from countries with only 1 order shown, such as Lebanon, Brazil, and RSA, so those are less stable than countries with larger order counts.
- This suggests a strong country effect: some international markets place much larger orders on average, while the global average is pulled down by countries with smaller basket sizes.
To fully answer how AOV varies by country, the missing piece is the complete country-level table, not just the top 15.
Display country aov
display(country_aov)| Country | average_order_value | orders | |
|---|---|---|---|
| 0 | Singapore | 3039.898571 | 7 |
| 1 | Netherlands | 3036.663191 | 94 |
| 2 | Australia | 2430.198421 | 57 |
| 3 | Japan | 1969.282632 | 19 |
| 4 | Lebanon | 1693.880000 | 1 |
| 5 | Hong Kong | 1426.527273 | 11 |
| 6 | Brazil | 1143.600000 | 1 |
| 7 | Sweden | 1066.064722 | 36 |
| 8 | Switzerland | 1057.220370 | 54 |
| 9 | Denmark | 1053.074444 | 18 |
| 10 | Israel | 1016.907500 | 8 |
| 11 | Norway | 1004.595556 | 36 |
| 12 | RSA | 1002.310000 | 1 |
| 13 | EIRE | 984.215139 | 288 |
| 14 | Greece | 952.104000 | 5 |
| 15 | Cyprus | 849.398750 | 16 |
| 16 | Channel Islands | 786.555385 | 26 |
| 17 | USA | 716.078000 | 5 |
| 18 | Spain | 684.190111 | 90 |
| 19 | United Arab Emirates | 634.093333 | 3 |
| 20 | Iceland | 615.714286 | 7 |
| 21 | Canada | 611.063333 | 6 |
| 22 | Austria | 599.922353 | 17 |
| 23 | Portugal | 581.846552 | 58 |
| 24 | Finland | 549.904390 | 41 |
| 25 | Malta | 545.118000 | 5 |
| 26 | France | 534.987526 | 392 |
| 27 | United Kingdom | 500.872528 | 18019 |
| 28 | Germany | 500.803370 | 457 |
| 29 | Italy | 460.085263 | 38 |
| 30 | Belgium | 420.370816 | 98 |
| 31 | Lithuania | 415.265000 | 4 |
| 32 | Czech Republic | 413.370000 | 2 |
| 33 | Poland | 386.034211 | 19 |
| 34 | Unspecified | 365.368462 | 13 |
| 35 | European Community | 325.062500 | 4 |
| 36 | Bahrain | 251.380000 | 3 |
| 37 | Saudi Arabia | 145.920000 | 1 |
The average order value is 534.40 USD overall, based on 19,960 orders.
It varies a lot by country. The highest average order values in the observed output are:
- Singapore: 3039.90 USD across 7 orders
- Netherlands: 3036.66 USD across 94 orders
- Australia: 2430.20 USD across 57 orders
- Japan: 1969.28 USD across 19 orders
- Hong Kong: 1426.53 USD across 11 orders
- Sweden: 1066.06 USD across 36 orders
- Switzerland: 1057.22 USD across 54 orders
- Denmark: 1053.07 USD across 18 orders
- EIRE: 984.22 USD across 288 orders
A few countries also show very high AOV with only 1 order:
- Lebanon: 1693.88 USD
- Brazil: 1143.60 USD
- RSA: 1002.31 USD
Those single-order countries should be treated cautiously because one order can make the average look unusually high.
The strongest takeaway is that AOV is much higher in several international markets than the overall average, especially Singapore, the Netherlands, and Australia. Among countries with a more meaningful number of orders, the Netherlands, Australia, Japan, Switzerland, Sweden, and EIRE stand out as having relatively high order values.