Sentiment Analysis of Amazon Reviews

Analyze sentiment in Amazon product reviews using VADER and TextBlob, visualize score distributions, and identify most positive and negative reviews.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads a sample of the Amazon Fine Food Reviews dataset from a URL and summarizes dataset shape and star rating distribution. It generates VADER sentiment scores for each review and adds them as new columns for analysis. It visualizes sentiment score distributions and their relationship to star ratings, then extracts the most positive and most negative review excerpts with scores.

Who this example is for

This is for analysts and data scientists who want a reproducible notebook pattern for sentiment scoring and basic validation against existing labels like star ratings. It is also useful for NLP learners comparing lexicon-based sentiment methods and reviewing edge cases by inspecting extreme examples.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset shape and star rating histogram
  • VADER sentiment scores appended as new columns
  • Histogram of compound sentiment scores
  • Scatter plot comparing sentiment scores to star ratings
  • Three most positive and three most negative review excerpts with sentiment scores

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • nltk
  • vaderSentiment
  • textblob
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution
  2. 2compute sentiment scores using VADER for each review
  3. 3plot sentiment score distribution and compare with star ratings
  4. 4show the 3 most positive and 3 most negative reviews

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next9/102/22/23/31/21/1Open qwen3-coder-next conversation
glm-5.18/102/22/23/31/20/1Open glm-5.1 conversation
gpt-5.48/101/22/22/32/21/1Open gpt-5.4 conversation
qwen3.5:397b4/101/21/21/31/20/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with qwen3.5:397b

This run uses AI with a step-by-step list of prompts to perform the analysis task Sentiment Analysis of Amazon Reviews. Below, you can review the full conversation generated with qwen3.5:397b. After the conversation, we show detailed score cards for this run.

load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution
Thinking

The user wants me to load a reviews dataset from a URL and show:

  1. The shape of the dataset
  2. The rating distribution

I'll load the CSV file using pandas, then display the shape and create a rating distribution visualization. Since they asked for rating distribution, I'll use seaborn for a simple bar plot or histogram.

Let me write one code block that:

  1. Loads the data
  2. Shows the shape
  3. Shows the rating distribution (using seaborn)

I'll keep it focused on this single task.

I'll load the Amazon Fine Food Reviews dataset and show its shape along with the rating distribution.

Notebook code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv')

# Show shape
df.shape

# Show rating distribution
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='Score', palette='Blues_d')
plt.title('Rating Distribution')
plt.xlabel('Rating (Score)')
plt.ylabel('Count')
plt.tight_layout()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_3296/3051264057.py:13: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(data=df, x='Score', palette='Blues_d')
Thinking

The notebook state shows that the code was executed, but there's a mismatch. The variables section shows df has columns like "Row ID", "Order ID", "Order Date" etc. - these look like a retail/orders dataset, NOT the Amazon Fine Food Reviews dataset.

The URL in the code points to amazon_fine_food_reviews_10k.csv but the actual loaded data appears to be from a different dataset (looks like Superstore orders data with columns like Customer ID, Product ID, Sales, Profit, etc.).

This is a problem - the dataset didn't load correctly from the URL, or there's some caching issue, or the URL is returning different data than expected.

The user's goal was to load the Amazon Fine Food Reviews dataset and show shape and rating distribution. The expected columns for that dataset should include things like: Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text.

But the actual df shows retail/order columns instead.

I need to verify what actually got loaded and potentially re-load from the correct URL. Let me check the actual columns and see if there's a 'Score' column for ratings.

Looking at the columns: [Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,City,State,Country,Postal Code,Market,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority]

There's no 'Score' column here! This is definitely not the Amazon Fine Food Reviews dataset.

The chart executed but it would have failed or shown nothing since there's no 'Score' column. The output just shows <Figure size 800x500 with 1 Axes> which doesn't tell us if the plot actually worked.

I need to:

  1. First check what columns are actually in the dataframe
  2. Re-load from the correct URL if needed

Let me take Path B - I need to verify the actual data loaded and show the columns to understand what happened.

There's an issue here. The dataframe loaded has 51,290 rows and 24 columns, but the columns shown (Row ID, Order ID, Customer Name, Sales, Profit, etc.) indicate this is not the Amazon Fine Food Reviews dataset - it appears to be a retail/orders dataset instead.

The Amazon Fine Food Reviews dataset should have columns like: Id, ProductId, UserId, Score, Time, Summary, Text, etc.

Let me verify the actual columns and re-load from the correct source:

Check loaded columns
# Check actual columns loaded
df.columns.tolist()
['Id',
 'ProductId',
 'UserId',
 'ProfileName',
 'HelpfulnessNumerator',
 'HelpfulnessDenominator',
 'Score',
 'Time',
 'Summary',
 'Text']
Thinking

The user's original goal was to load the Amazon Fine Food Reviews dataset and show shape and rating distribution.

Looking at the notebook outputs:

  • Cell-1 loaded the dataset and created a rating distribution plot
  • Cell-2 confirmed the columns are correct for Amazon Fine Food Reviews: ['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text']

However, I notice that the shape output from cell-1 is not explicitly shown in the outputs. The notebook state shows df has shape 51290 x 24, but that seems inconsistent with the columns shown in output-2 which only has 10 columns.

Looking at output-2, the columns are correct for the Amazon Fine Food Reviews dataset. The rating distribution chart was created in cell-1, but I don't see the actual shape output or the numeric counts for the rating distribution.

For the original goal:

  1. Load dataset - Done (columns confirmed correct)
  2. Show shape - The shape was requested but the output isn't visible in the provided outputs
  3. Show rating distribution - A chart was created but the actual numeric distribution values aren't shown

Since the chart was created but the underlying numeric values for the rating distribution aren't displayed, and the shape output isn't visible, I should choose Path B to display these values explicitly.

Let me provide code to show the shape and the actual rating distribution counts.

The dataset loaded correctly with the expected Amazon Fine Food Reviews columns (Id, ProductId, UserId, Score, Text, etc.). The rating distribution chart was created, but let me display the actual shape and numeric rating counts for clarity:

Show dataset shape and rating distribution
# Show shape
print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")

# Show rating distribution counts
df['Score'].value_counts().sort_index()
Dataset shape: 10000 rows × 10 columns
Score
1     932
2     590
3     862
4    1433
5    6183
Name: count, dtype: int64
Thinking

The original user goal was to:

  1. Load the reviews dataset from the specified URL
  2. Show shape
  3. Show rating distribution

Looking at the notebook outputs:

  • Cell 1: Loaded the dataset, showed shape (via df.shape), and created a rating distribution chart
  • Cell 2: Verified columns - confirmed it's the correct Amazon Fine Food Reviews dataset with 10 columns (Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text)
  • Cell 3: Showed shape (10000 rows × 10 columns) and rating distribution counts

The outputs clearly show:

  • Dataset shape: 10,000 rows × 10 columns
  • Rating distribution with actual counts:
    • Score 1: 932 reviews
    • Score 2: 590 reviews
    • Score 3: 862 reviews
    • Score 4: 1,433 reviews
    • Score 5: 6,183 reviews

The chart was also created in Cell 1 (Figure output shown).

All three parts of the original goal have been completed with concrete outputs visible. The shape is shown, the rating distribution chart exists, and the numeric counts are displayed. This meets the sufficiency criteria for Path A.

I should provide a natural final answer summarizing what was accomplished, include the <!--final answer--> tag, and not include any Python code block.

The original goal has been completed. Here's what the outputs show:

Dataset Loaded Successfully

  • Shape: 10,000 rows × 10 columns
  • Columns: Id, ProductId, UserId, ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary, Text

Rating Distribution

ScoreCount
1932
2590
3862
41,433
56,183

Key insight: The dataset is heavily skewed toward positive reviews. 5-star ratings dominate with 6,183 reviews (61.8% of all reviews), while 1-star and 2-star ratings combined account for only 1,522 reviews (15.2%). This is a common pattern in e-commerce review datasets where satisfied customers are more likely to leave reviews.

The rating distribution chart was also generated showing this distribution visually.

compute sentiment scores using VADER for each review
Thinking

The user wants to compute sentiment scores using VADER for each review in the Amazon Fine Food Reviews dataset. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that's particularly good for social media and review text.

I need to:

  1. First check if VADER is installed, if not install it
  2. Apply VADER sentiment analysis to the review text (likely the 'Text' column)
  3. Add the sentiment scores (compound, positive, neutral, negative) to the dataframe

Let me start with installing nltk (which contains VADER) and then applying the sentiment analysis.

I'll compute VADER sentiment scores for each review text. First, let me install the required library and apply sentiment analysis to the 'Text' column:

Compute sentiment scores for text
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download VADER lexicon if not already available
nltk.download('vader_lexicon', quiet=True)

# Initialize VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Function to get sentiment scores
def get_sentiment_scores(text):
    if pd.isna(text):
        return {'compound': 0, 'pos': 0, 'neu': 0, 'neg': 0}
    return sia.polarity_scores(str(text))

# Apply VADER to each review text
sentiment_df = df['Text'].apply(get_sentiment_scores).apply(pd.Series)
sentiment_df.columns = ['compound', 'pos', 'neu', 'neg']

# Add sentiment scores to main dataframe
df = pd.concat([df, sentiment_df], axis=1)

# Show first few rows with sentiment scores
df[['Score', 'Text', 'compound', 'pos', 'neu', 'neg']].head()
Thinking

Looking at the notebook state and outputs, I can see that cell-4 has been executed which computes VADER sentiment scores for each review. However, I don't see the actual output from cell-4 in the provided notebook outputs. The output section shows outputs for cells 1-3, but cell-4's output is not visible.

The original goal was to "compute sentiment scores using VADER for each review" and the code in cell-4 does exactly that - it:

  1. Downloads VADER lexicon
  2. Initializes the sentiment analyzer
  3. Applies it to each review text
  4. Adds compound, pos, neu, neg columns to the dataframe
  5. Shows the first few rows with sentiment scores

However, I don't see the actual output from cell-4. Without seeing the output, I cannot verify that the sentiment scores were computed correctly. I should choose Path B to display the results that were computed.

Let me provide a code block to show the sentiment scores that were computed, including some summary statistics to give concrete insights about the sentiment analysis results.

The VADER sentiment analysis code has been executed, but I need to see the actual computed results to confirm the sentiment scores were properly added to the dataframe. Let me display the sentiment score results and some summary statistics:

Notebook code
# The error indicates the sentiment columns are missing from the dataframe.
# This often happens if the previous cell (cell-4) failed or wasn't executed successfully,
# or if the dataframe 'df' was reset. We will re-run the VADER analysis to ensure
# the columns exist before trying to access them.

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Ensure VADER lexicon is available
nltk.download('vader_lexicon', quiet=True)
sia = SentimentIntensityAnalyzer()

def get_sentiment_scores(text):
    if pd.isna(text):
        return {'compound': 0, 'pos': 0, 'neu': 0, 'neg': 0}
    return sia.polarity_scores(str(text))

# Re-compute sentiment scores and add to dataframe
# We create a new dataframe for scores to avoid modifying the original 'Text' column logic
sentiment_df = df['Text'].apply(get_sentiment_scores).apply(pd.Series)
sentiment_df.columns = ['compound', 'pos', 'neu', 'neg']

# Concatenate to the main dataframe
df = pd.concat([df, sentiment_df], axis=1)

# Verify the columns are now present and display the requested data
print("Columns after update:", df.columns.tolist())
df[['Score', 'Summary', 'compound', 'pos', 'neu', 'neg']].head(10)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[4], line 2
      1 # Show the sentiment scores added to dataframe
----> 2 df[['Score', 'Summary', 'compound', 'pos', 'neu', 'neg']].head(10)
      3 
      4 # Show summary statistics of sentiment scores
      5 df[['compound', 'pos', 'neu', 'neg']].describe()

File ~/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/pandas/core/frame.py:4384, in DataFrame.__getitem__(self, key)
   4380                 indexer = [indexer]
   4381         else:
   4382             if is_iterator(key):
   4383                 key = list(key)
-> 4384             indexer = self.columns._get_indexer_strict(key, "columns")[1]
   4385 
   4386         # take() does not accept boolean indexers
   4387         if getattr(indexer, "dtype", None) == bool:

File ~/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/pandas/core/indexes/base.py:6302, in Index._get_indexer_strict(self, key, axis_name)
   6299 else:
   6300     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6302 self._raise_if_missing(keyarr, indexer, axis_name)
   6304 keyarr = self.take(indexer)
   6305 if isinstance(key, Index):
   6306     # GH 42790 - Preserve name from an Index

File ~/Library/mljar-studio/jlab_server/lib/python3.12/site-packages/pandas/core/indexes/base.py:6355, in Index._raise_if_missing(self, key, indexer, axis_name)
   6352     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6354 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6355 raise KeyError(f"{not_found} not in index")

KeyError: "['compound', 'pos', 'neu', 'neg'] not in index"

Score cards for qwen3.5:397b for Sentiment Analysis of Amazon Reviews

Total score: 4/10

  • Task Completion: 1/2

    Needs work

    The notebook completed the initial dataset load and rating distribution, and it attempted VADER sentiment scoring. However, it did not complete the full workflow expected in the prompt, including the scatter comparison and the 3 most positive/negative reviews.

  • Execution Correctness: 1/2

    Needs work

    The code is partially runnable, but the final cell raises a KeyError when selecting sentiment columns, indicating the workflow is broken at the end. Earlier cells executed, but the notebook is not fully correct end-to-end.

  • Output Quality: 1/3

    Needs work

    The outputs include the dataset shape and rating counts, but the required sentiment-score outputs, plots, and review excerpts are missing. The final error also prevents the requested sentiment table from being shown.

  • Reasoning Quality: 1/2

    Needs work

    The notebook shows some correct reasoning about loading the dataset and checking columns, but it also contains contradictory commentary and does not follow through on the full analysis. The explanation around the error is only partially helpful.

  • Reliability: 0/1

    Needs work

    The workflow is fragile and inconsistent, ending in a KeyError and failing to reliably produce the requested analysis outputs.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.