Sentiment Analysis of Amazon Reviews

Analyze sentiment in Amazon product reviews using VADER and TextBlob, visualize score distributions, and identify most positive and negative reviews.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads a sample of the Amazon Fine Food Reviews dataset from a URL and summarizes dataset shape and star rating distribution. It generates VADER sentiment scores for each review and adds them as new columns for analysis. It visualizes sentiment score distributions and their relationship to star ratings, then extracts the most positive and most negative review excerpts with scores.

Who this example is for

This is for analysts and data scientists who want a reproducible notebook pattern for sentiment scoring and basic validation against existing labels like star ratings. It is also useful for NLP learners comparing lexicon-based sentiment methods and reviewing edge cases by inspecting extreme examples.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Dataset shape and star rating histogram
VADER sentiment scores appended as new columns
Histogram of compound sentiment scores
Scatter plot comparing sentiment scores to star ratings
Three most positive and three most negative review excerpts with sentiment scores

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

pandas
nltk
vaderSentiment
textblob
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution
2compute sentiment scores using VADER for each review
3plot sentiment score distribution and compare with star ratings
4show the 3 most positive and 3 most negative reviews

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3-coder-next	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3-coder-next conversation
glm-5.1	8/10	2/2	2/2	3/3	1/2	0/1	Open glm-5.1 conversation
qwen3.5:397b	4/10	1/2	1/2	1/3	1/2	0/1	Open qwen3.5:397b conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3.5:397b

This run uses AI with a step-by-step list of prompts to perform the analysis task Sentiment Analysis of Amazon Reviews. Below, you can review the full conversation generated with qwen3.5:397b. After the conversation, we show detailed score cards for this run.

load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution

Thinking

The user wants me to load a reviews dataset from a URL and show:

The shape of the dataset
The rating distribution

I'll load the CSV file using pandas, then display the shape and create a rating distribution visualization. Since they asked for rating distribution, I'll use seaborn for a simple bar plot or histogram.

Let me write one code block that:

Loads the data
Shows the shape
Shows the rating distribution (using seaborn)

I'll keep it focused on this single task.

I'll load the Amazon Fine Food Reviews dataset and show its shape along with the rating distribution.

Notebook code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv')

# Show shape
df.shape

# Show rating distribution
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='Score', palette='Blues_d')
plt.title('Rating Distribution')
plt.xlabel('Rating (Score)')
plt.ylabel('Count')
plt.tight_layout()

Thinking

The notebook state shows that the code was executed, but there's a mismatch. The variables section shows df has columns like "Row ID", "Order ID", "Order Date" etc. - these look like a retail/orders dataset, NOT the Amazon Fine Food Reviews dataset.

The URL in the code points to amazon_fine_food_reviews_10k.csv but the actual loaded data appears to be from a different dataset (looks like Superstore orders data with columns like Customer ID, Product ID, Sales, Profit, etc.).

This is a problem - the dataset didn't load correctly from the URL, or there's some caching issue, or the URL is returning different data than expected.

The user's goal was to load the Amazon Fine Food Reviews dataset and show shape and rating distribution. The expected columns for that dataset should include things like: Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text.

But the actual df shows retail/order columns instead.

I need to verify what actually got loaded and potentially re-load from the correct URL. Let me check the actual columns and see if there's a 'Score' column for ratings.

Looking at the columns: [Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,City,State,Country,Postal Code,Market,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority]

There's no 'Score' column here! This is definitely not the Amazon Fine Food Reviews dataset.

The chart executed but it would have failed or shown nothing since there's no 'Score' column. The output just shows <Figure size 800x500 with 1 Axes> which doesn't tell us if the plot actually worked.

I need to:

First check what columns are actually in the dataframe
Re-load from the correct URL if needed

Let me take Path B - I need to verify the actual data loaded and show the columns to understand what happened.

There's an issue here. The dataframe loaded has 51,290 rows and 24 columns, but the columns shown (Row ID, Order ID, Customer Name, Sales, Profit, etc.) indicate this is not the Amazon Fine Food Reviews dataset - it appears to be a retail/orders dataset instead.

The Amazon Fine Food Reviews dataset should have columns like: Id, ProductId, UserId, Score, Time, Summary, Text, etc.

Let me verify the actual columns and re-load from the correct source:

Check loaded columns

# Check actual columns loaded
df.columns.tolist()

['Id',
 'ProductId',
 'UserId',
 'ProfileName',
 'HelpfulnessNumerator',
 'HelpfulnessDenominator',
 'Score',
 'Time',
 'Summary',
 'Text']

#10

Thinking

The user's original goal was to load the Amazon Fine Food Reviews dataset and show shape and rating distribution.

Looking at the notebook outputs:

Cell-1 loaded the dataset and created a rating distribution plot
Cell-2 confirmed the columns are correct for Amazon Fine Food Reviews: ['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text']

However, I notice that the shape output from cell-1 is not explicitly shown in the outputs. The notebook state shows df has shape 51290 x 24, but that seems inconsistent with the columns shown in output-2 which only has 10 columns.

Looking at output-2, the columns are correct for the Amazon Fine Food Reviews dataset. The rating distribution chart was created in cell-1, but I don't see the actual shape output or the numeric counts for the rating distribution.

For the original goal:

Load dataset - Done (columns confirmed correct)
Show shape - The shape was requested but the output isn't visible in the provided outputs
Show rating distribution - A chart was created but the actual numeric distribution values aren't shown

Since the chart was created but the underlying numeric values for the rating distribution aren't displayed, and the shape output isn't visible, I should choose Path B to display these values explicitly.

Let me provide code to show the shape and the actual rating distribution counts.

#11

The dataset loaded correctly with the expected Amazon Fine Food Reviews columns (Id, ProductId, UserId, Score, Text, etc.). The rating distribution chart was created, but let me display the actual shape and numeric rating counts for clarity:

#12

Show dataset shape and rating distribution

# Show shape
print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")

# Show rating distribution counts
df['Score'].value_counts().sort_index()

#13

Dataset shape: 10000 rows × 10 columns

#14

Score
1     932
2     590
3     862
4    1433
5    6183
Name: count, dtype: int64

#15

Thinking

The original user goal was to:

Load the reviews dataset from the specified URL
Show shape
Show rating distribution

Looking at the notebook outputs:

Cell 1: Loaded the dataset, showed shape (via df.shape), and created a rating distribution chart
Cell 2: Verified columns - confirmed it's the correct Amazon Fine Food Reviews dataset with 10 columns (Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text)
Cell 3: Showed shape (10000 rows × 10 columns) and rating distribution counts

The outputs clearly show:

Dataset shape: 10,000 rows × 10 columns
Rating distribution with actual counts:
- Score 1: 932 reviews
- Score 2: 590 reviews
- Score 3: 862 reviews
- Score 4: 1,433 reviews
- Score 5: 6,183 reviews

The chart was also created in Cell 1 (Figure output shown).

All three parts of the original goal have been completed with concrete outputs visible. The shape is shown, the rating distribution chart exists, and the numeric counts are displayed. This meets the sufficiency criteria for Path A.

I should provide a natural final answer summarizing what was accomplished, include the tag, and not include any Python code block.

#16

The original goal has been completed. Here's what the outputs show:

Dataset Loaded Successfully

Shape: 10,000 rows × 10 columns
Columns: Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text

Rating Distribution

Score	Count
1	932
2	590
3	862
4	1,433
5	6,183

Key insight: The dataset is heavily skewed toward positive reviews. 5-star ratings dominate with 6,183 reviews (61.8% of all reviews), while 1-star and 2-star ratings combined account for only 1,522 reviews (15.2%). This is a common pattern in e-commerce review datasets where satisfied customers are more likely to leave reviews.

The rating distribution chart was also generated showing this distribution visually.

#17

compute sentiment scores using VADER for each review

#18

Thinking

The user wants to compute sentiment scores using VADER for each review in the Amazon Fine Food Reviews dataset. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that's particularly good for social media and review text.

I need to:

First check if VADER is installed, if not install it
Apply VADER sentiment analysis to the review text (likely the 'Text' column)
Add the sentiment scores (compound, positive, neutral, negative) to the dataframe

Let me start with installing nltk (which contains VADER) and then applying the sentiment analysis.

#19

I'll compute VADER sentiment scores for each review text. First, let me install the required library and apply sentiment analysis to the 'Text' column:

#20

Compute sentiment scores for text

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download VADER lexicon if not already available
nltk.download('vader_lexicon', quiet=True)

# Initialize VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Function to get sentiment scores
def get_sentiment_scores(text):
    if pd.isna(text):
        return {'compound': 0, 'pos': 0, 'neu': 0, 'neg': 0}
    return sia.polarity_scores(str(text))

# Apply VADER to each review text
sentiment_df = df['Text'].apply(get_sentiment_scores).apply(pd.Series)
sentiment_df.columns = ['compound', 'pos', 'neu', 'neg']

# Add sentiment scores to main dataframe
df = pd.concat([df, sentiment_df], axis=1)

# Show first few rows with sentiment scores
df[['Score', 'Text', 'compound', 'pos', 'neu', 'neg']].head()

#21

Thinking

Looking at the notebook state and outputs, I can see that cell-4 has been executed which computes VADER sentiment scores for each review. However, I don't see the actual output from cell-4 in the provided notebook outputs. The output section shows outputs for cells 1-3, but cell-4's output is not visible.

The original goal was to "compute sentiment scores using VADER for each review" and the code in cell-4 does exactly that - it:

Downloads VADER lexicon
Initializes the sentiment analyzer
Applies it to each review text
Adds compound, pos, neu, neg columns to the dataframe
Shows the first few rows with sentiment scores

However, I don't see the actual output from cell-4. Without seeing the output, I cannot verify that the sentiment scores were computed correctly. I should choose Path B to display the results that were computed.

Let me provide a code block to show the sentiment scores that were computed, including some summary statistics to give concrete insights about the sentiment analysis results.

#22

The VADER sentiment analysis code has been executed, but I need to see the actual computed results to confirm the sentiment scores were properly added to the dataframe. Let me display the sentiment score results and some summary statistics:

#23

Notebook code

# The error indicates the sentiment columns are missing from the dataframe.
# This often happens if the previous cell (cell-4) failed or wasn't executed successfully,
# or if the dataframe 'df' was reset. We will re-run the VADER analysis to ensure
# the columns exist before trying to access them.

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Ensure VADER lexicon is available
nltk.download('vader_lexicon', quiet=True)
sia = SentimentIntensityAnalyzer()

def get_sentiment_scores(text):
    if pd.isna(text):
        return {'compound': 0, 'pos': 0, 'neu': 0, 'neg': 0}
    return sia.polarity_scores(str(text))

# Re-compute sentiment scores and add to dataframe
# We create a new dataframe for scores to avoid modifying the original 'Text' column logic
sentiment_df = df['Text'].apply(get_sentiment_scores).apply(pd.Series)
sentiment_df.columns = ['compound', 'pos', 'neu', 'neg']

# Concatenate to the main dataframe
df = pd.concat([df, sentiment_df], axis=1)

# Verify the columns are now present and display the requested data
print("Columns after update:", df.columns.tolist())
df[['Score', 'Summary', 'compound', 'pos', 'neu', 'neg']].head(10)

#24

[31m---------------------------------------------------------------------------[39m
[31mKeyError[39m                                  Traceback (most recent call last)
[36mCell[39m[36m [39m[32mIn[4][39m[32m, line 2[39m
[32m      1[39m [38;5;66;03m# Show the sentiment scores added to dataframe[39;00m
[32m----> [39m[32m2[39m df[[[33m'Score'[39m, [33m'Summary'[39m, [33m'compound'[39m, [33m'pos'[39m, [33m'neu'[39m, [33m'neg'[39m]].head([32m10[39m)
[32m      3[39m 
[32m      4[39m [38;5;66;03m# Show summary statistics of sentiment scores[39;00m
[32m      5[39m df[[[33m'compound'[39m, [33m'pos'[39m, [33m'neu'[39m, [33m'neg'[39m]].describe()

[36mFile [39m[32m~/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/pandas/core/frame.py:4384[39m, in [36mDataFrame.__getitem__[39m[34m(self, key)[39m
[32m   4380[39m                 indexer = [indexer]
[32m   4381[39m         [38;5;28;01melse[39;00m:
[32m   4382[39m             [38;5;28;01mif[39;00m is_iterator(key):
[32m   4383[39m                 key = list(key)
[32m-> [39m[32m4384[39m             indexer = self.columns._get_indexer_strict(key, [33m"columns"[39m)[[32m1[39m]
[32m   4385[39m 
[32m   4386[39m         [38;5;66;03m# take() does not accept boolean indexers[39;00m
[32m   4387[39m         [38;5;28;01mif[39;00m getattr(indexer, [33m"dtype"[39m, [38;5;28;01mNone[39;00m) == bool:

[36mFile [39m[32m~/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/pandas/core/indexes/base.py:6302[39m, in [36mIndex._get_indexer_strict[39m[34m(self, key, axis_name)[39m
[32m   6299[39m [38;5;28;01melse[39;00m:
[32m   6300[39m     keyarr, indexer, new_indexer = [38;5;28mself[39m._reindex_non_unique(keyarr)
[32m-> [39m[32m6302[39m [38;5;28;43mself[39;49m[43m.[49m[43m_raise_if_missing[49m[43m([49m[43mkeyarr[49m[43m,[49m[43m [49m[43mindexer[49m[43m,[49m[43m [49m[43maxis_name[49m[43m)[49m
[32m   6304[39m keyarr = [38;5;28mself[39m.take(indexer)
[32m   6305[39m [38;5;28;01mif[39;00m [38;5;28misinstance[39m(key, Index):
[32m   6306[39m     [38;5;66;03m# GH 42790 - Preserve name from an Index[39;00m

[36mFile [39m[32m~/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/pandas/core/indexes/base.py:6355[39m, in [36mIndex._raise_if_missing[39m[34m(self, key, indexer, axis_name)[39m
[32m   6352[39m     [38;5;28;01mraise[39;00m [38;5;167;01mKeyError[39;00m([33mf[39m[33m"[39m[33mNone of [[39m[38;5;132;01m{[39;00mkey[38;5;132;01m}[39;00m[33m] are in the [[39m[38;5;132;01m{[39;00maxis_name[38;5;132;01m}[39;00m[33m][39m[33m"[39m)
[32m   6354[39m not_found = [38;5;28mlist[39m(ensure_index(key)[missing_mask.nonzero()[[32m0[39m]].unique())
[32m-> [39m[32m6355[39m [38;5;28;01mraise[39;00m [38;5;167;01mKeyError[39;00m([33mf[39m[33m"[39m[38;5;132;01m{[39;00mnot_found[38;5;132;01m}[39;00m[33m not in index[39m[33m"[39m)

[31mKeyError[39m: "['compound', 'pos', 'neu', 'neg'] not in index"

Score cards for qwen3.5:397b for Sentiment Analysis of Amazon Reviews

Total score: 4/10

Task Completion: 1/2
Needs work
The notebook completed the initial load/shape/rating distribution step, but it did not successfully complete the VADER sentiment analysis or the requested review excerpts. The expected outcomes beyond the first step are missing.
Execution Correctness: 1/2
Needs work
The dataset load and rating count code ran, but the sentiment-analysis cell ended in a KeyError when accessing missing columns. This indicates the workflow is only partially runnable.
Output Quality: 1/3
Needs work
The shape and rating distribution outputs are present, but the required sentiment-score outputs, plots, and 6 review excerpts are absent. The final visible error prevents the workflow from matching the expected outcomes.
Reasoning Quality: 1/2
Needs work
The assistant correctly identified the dataset and the initial outputs, but the reasoning became inconsistent around the failed sentiment columns and did not resolve the issue. It shows partial understanding but not a complete, correct analysis flow.
Reliability: 0/1
Needs work
The workflow is fragile because it ends with an unhandled KeyError and does not robustly verify that the sentiment columns exist before using them.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Studio Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Sentiment Analysis of Amazon Reviews

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with qwen3.5:397b

Score cards for qwen3.5:397b for Sentiment Analysis of Amazon Reviews

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics