Sentiment Analysis of Amazon Reviews

Analyze sentiment in Amazon product reviews using VADER and TextBlob, visualize score distributions, and identify most positive and negative reviews.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads a sample of the Amazon Fine Food Reviews dataset from a URL and summarizes dataset shape and star rating distribution. It generates VADER sentiment scores for each review and adds them as new columns for analysis. It visualizes sentiment score distributions and their relationship to star ratings, then extracts the most positive and most negative review excerpts with scores.

Who this example is for

This is for analysts and data scientists who want a reproducible notebook pattern for sentiment scoring and basic validation against existing labels like star ratings. It is also useful for NLP learners comparing lexicon-based sentiment methods and reviewing edge cases by inspecting extreme examples.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset shape and star rating histogram
  • VADER sentiment scores appended as new columns
  • Histogram of compound sentiment scores
  • Scatter plot comparing sentiment scores to star ratings
  • Three most positive and three most negative review excerpts with sentiment scores

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • nltk
  • vaderSentiment
  • textblob
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution
  2. 2compute sentiment scores using VADER for each review
  3. 3plot sentiment score distribution and compare with star ratings
  4. 4show the 3 most positive and 3 most negative reviews

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next10/102/22/23/32/21/1Open qwen3-coder-next conversation
glm-5.18/102/22/23/31/20/1Open glm-5.1 conversation
qwen3.5:397b4/101/21/21/31/20/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Sentiment Analysis of Amazon Reviews. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution
Thinking

Considering dataset inspection

I’m thinking about the user's request to load a dataset and show its shape and rating distribution. I know the rule says to first load and inspect by showing the shape, columns, and head. However, the user specifically asked for the rating distribution too. Maybe I can load the dataset and show both the shape and a table of the rating distribution, which seems to fit as an initial inspection. It feels like balancing conflicting guidelines!Thinking about output format

I’m grappling with how to best display the dataset information for the user. It looks like I can include minimal code to show the shape and score distribution sequentially. I can use display for multiple tables, reading the CSV into a DataFrame, followed by showing the shape and score distribution. The challenge is that the rules suggest one output per step. Maybe I could show the shape as a Python tuple while displaying the distribution table separately. That should work!Finalizing code structure

I’ve decided that my code block should end with one final bare expression. I can definitely display the distribution table and use reviews.shape as that final expression. This way, I can provide both outputs effectively. I think naming the dataset “reviews” works well too. There’s no need for me to ask to share the output; I’ll just supply the code and stop there with a brief explanation! Sounds like a solid plan!

Let’s start by loading the CSV and showing the dataset shape plus a simple rating distribution table from the Score column.

Summarize review ratings
import pandas as pd

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv"

reviews = pd.read_csv(url)

rating_dist = (
    reviews["Score"]
    .value_counts()
    .sort_index()
    .rename_axis("Score")
    .reset_index(name="count")
)

rating_dist["percent"] = (rating_dist["count"] / len(reviews) * 100).round(2)

display(rating_dist)

reviews.shape
Score count percent
0 1 932 9.32
1 2 590 5.90
2 3 862 8.62
3 4 1433 14.33
4 5 6183 61.83
(10000, 10)

The reviews dataset has 10,000 rows and 10 columns.

The rating distribution is strongly skewed toward high scores:

  • 5-star reviews: 6,183 (61.83%)
  • 4-star reviews: 1,433 (14.33%)
  • 3-star reviews: 862 (8.62%)
  • 2-star reviews: 590 (5.90%)
  • 1-star reviews: 932 (9.32%)

This means the dataset is heavily dominated by positive reviews, with nearly 3 out of 5 reviews rated 5 stars.

compute sentiment scores using VADER for each review
Thinking

Analyzing sentiment computation

I need to respond step-by-step based on the user's request. They want me to calculate sentiment scores using VADER for each review. First, I'll ensure the VADER lexicon is loaded, and then I'll compute sentiment scores from the review text, probably from the Text column. It’s essential to provide runnable Python code that will score the Text column using VADER and include negative, neutral, positive, and compound scores. Let's keep it concise!

Next, we can score each review in the Text column with VADER and add the four standard sentiment fields: neg, neu, pos, and compound.

Compute review sentiment scores
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download("vader_lexicon")

sia = SentimentIntensityAnalyzer()

sentiment_scores = reviews["Text"].fillna("").apply(sia.polarity_scores).apply(pd.Series)

reviews = reviews.join(sentiment_scores)

reviews[["Score", "compound", "neg", "neu", "pos"]].head()
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/piotr/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
Score compound neg neu pos
0 5 0.9441 0.000 0.695 0.305
1 1 -0.5664 0.138 0.862 0.000
2 4 0.8265 0.091 0.754 0.155
3 2 0.0000 0.000 1.000 0.000
4 5 0.9468 0.000 0.552 0.448

VADER sentiment scores have been computed for each review and added to the reviews table as four new columns:

  • neg: negative sentiment proportion
  • neu: neutral sentiment proportion
  • pos: positive sentiment proportion
  • compound: overall normalized sentiment score from -1 to 1

Based on the notebook state, reviews now has 10,000 rows and 14 columns, so all 10,000 reviews were scored successfully.

The sample output shows the scores behave as expected:

  • A 5-star review has a strongly positive compound score around 0.94
  • A 1-star review has a negative compound score around -0.57
  • A 2-star review can appear neutral with compound = 0.0, which is normal when the wording is more factual than emotional

This means your original goal is satisfied: each review now has VADER-based sentiment features that you can use for analysis or modeling.

plot sentiment score distribution and compare with star ratings
Thinking

Planning sentiment analysis visualization

Okay, the user wants me to iteratively answer by plotting the sentiment score distribution and comparing it with star ratings. It seems like I should provide one chart or table at a time, maybe combining both in one plot? A boxplot or violin plot for the compound score could work, while a histogram could show the overall distribution. I believe I’ll focus on analyzing the compound distribution across star ratings with a seaborn boxplot for clear insights. I'll code that next!

The sample already suggests higher star ratings tend to have higher VADER compound scores. A clean next view is a boxplot of compound by Score so we can compare the sentiment distribution across ratings.

Plot sentiment by rating
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.boxplot(data=reviews, x="Score", y="compound", color="skyblue")
plt.title("VADER compound sentiment by star rating")
plt.xlabel("Star rating")
plt.ylabel("Compound sentiment score")
plt.ylim(-1, 1)
plt.show()
Thinking

Evaluating code presentation

I'm considering whether to end with a bare expression or just use display. It seems like displaying only might be sufficient, and perhaps I could assign comp_by_score and use display(comp_by_score) without needing a final bare expression. That feels like it could keep the code minimal, which is important to me—I want to make sure it stays clear and efficient. Let’s see if I can finalize it that way!

The boxplot setup is appropriate for comparing VADER sentiment with star ratings, and the earlier sample already points in the expected direction: 5-star reviews show strongly positive compound scores, while 1-star reviews can be clearly negative. To make that comparison concrete from the chart, the missing piece is a compact summary of the compound distribution by Score.

Summarize compound scores by rating
comp_by_score = (
    reviews.groupby("Score")["compound"]
    .agg(
        count="size",
        mean="mean",
        median="median",
        q1=lambda s: s.quantile(0.25),
        q3=lambda s: s.quantile(0.75),
    )
    .round(3)
    .reset_index()
)

display(comp_by_score)
Score count mean median q1 q3
0 1 932 0.009 0.000 -0.620 0.621
1 2 590 0.335 0.527 -0.113 0.858
2 3 862 0.479 0.718 0.102 0.900
3 4 1433 0.720 0.862 0.661 0.943
4 5 6183 0.778 0.886 0.750 0.948
Thinking

Comparing sentiment and ratings

I need to choose between A or B for analyzing the sentiment scores and star ratings. I have a chart-only output and a numeric summary table that will help me plot the sentiment score distribution. I’ll take a look at the star rating distribution, which shows that 61.83% of ratings are 5 stars, with a skew toward higher ratings. The sentiment scores rise as the ratings increase, revealing clear distinctions and mixed feelings in lower ratings. Let’s craft a concise narrative.Analyzing VADER alignment

I’m noting that VADER generally performs well, but it struggles with 1-star ratings. The median sentiment score for these reviews is neutral, and there's a broad spread in the scores. This indicates that some negative reviews use neutral wording or perhaps sarcasm, which VADER might miss. I might also want to reference a boxplot that illustrates this upward shift. This context will be helpful, so let's ensure to include a final marker in the summary.

The sentiment distribution lines up clearly with the star ratings.

Most reviews are high-rated to begin with: 5-star reviews make up 61.83% of the data, followed by 4-star reviews at 14.33%. Lower ratings are much less common.

From the compound sentiment scores:

  • 1 star: mean 0.009, median 0.000, IQR from -0.620 to 0.621
  • 2 stars: mean 0.335, median 0.527
  • 3 stars: mean 0.479, median 0.718
  • 4 stars: mean 0.720, median 0.862, IQR from 0.661 to 0.943
  • 5 stars: mean 0.778, median 0.886, IQR from 0.750 to 0.948

What this means:

  • The boxplot should show a steady upward shift in sentiment as star rating increases.
  • 4-star and 5-star reviews are strongly positive overall, with most of their sentiment scores concentrated well above 0.
  • 1-star reviews are much more spread out and not purely negative in VADER terms; their median is neutral and the middle 50% spans both negative and positive values. That suggests some low-rated reviews are written in relatively neutral language, or VADER does not fully capture the negative context.
  • The difference between 4 and 5 stars exists but is smaller than the jump from low ratings to high ratings.

So, the comparison shows a clear positive relationship: higher star ratings generally correspond to higher sentiment scores, with the strongest separation between low-rated and high-rated reviews.

show the 3 most positive and 3 most negative reviews

Let’s pull the extremes by VADER compound score so you can inspect the text directly.

Show most positive and negative reviews
cols = ["Score", "Summary", "Text", "compound"]

most_positive = reviews.sort_values("compound", ascending=False)[cols].head(3)
most_negative = reviews.sort_values("compound", ascending=True)[cols].head(3)

display(most_positive)
display(most_negative)
Score Summary Text compound
1320 1 NOT what I originally ordered! BOO, amazon! As another reviewer stated, many of the review... 0.9997
8756 5 At last we find a food fit for human consumption And we find her here upon this grand amazon!<b... 0.9997
539 5 C H I P.....C H I P.....H O O R A Y....!!!!!  ... I was getting VERY worried, when I read some o... 0.9989
Score Summary Text compound
6504 1 NOT GLUTEN FREE! After eating this product, I had a stomach ach... -0.9875
1113 1 PLEASE!!!!!Don't waste your money Bought Uncle Ray's Wavy's 16 oz potato chips b... -0.9824
3044 5 The best bad breath and no fowl after taste Took me years to find the ultimate chewing gum... -0.9803

Here are the 3 most positive and 3 most negative reviews by VADER compound score from the notebook output.

Most positive:

  1. Score 1“NOT what I originally ordered! BOO, amazon!”compound = 0.9997
    This is a strong example of sentiment mismatch: the written language is emotionally intense enough for VADER to score it as extremely positive, even though the star rating is 1.

  2. Score 5“At last we find a food fit for human consumption”compound = 0.9997
    Extremely positive language and a matching 5-star rating.

  3. Score 5“C H I P.....C H I P.....H O O R A Y....!!!!!”compound = 0.9989
    Very enthusiastic wording, punctuation, and emphasis likely pushed the sentiment score near the maximum.

Most negative:

  1. Score 1“NOT GLUTEN FREE!”compound = -0.9875
    Very strongly negative wording, and the star rating aligns with that sentiment.

  2. Score 1“PLEASE!!!!!Don't waste your money”compound = -0.9824
    Clear negative language with strong emphasis, again consistent with the 1-star rating.

  3. Score 5“The best bad breath and no fowl after taste”compound = -0.9803
    Another mismatch: despite the 5-star rating, the wording contains terms like “bad” that VADER likely interprets negatively without understanding the product context.

These results show that the ranking worked, but they also highlight a limitation of lexicon-based sentiment: a few reviews with sarcasm, unusual phrasing, or domain-specific wording can look inconsistent with their star ratings.

Score cards for gpt-5.4 for Sentiment Analysis of Amazon Reviews

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    The workflow loads the dataset, shows shape and rating distribution, computes VADER sentiment scores, visualizes sentiment vs. ratings, and identifies the 3 most positive and 3 most negative reviews.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically valid and the outputs indicate it ran successfully, including dataset loading, sentiment scoring, plotting, and review ranking.

  • Output Quality: 3/3

    Excellent

    All expected outcomes are present: shape, rating histogram/table, sentiment columns added, sentiment comparison plot, and six review excerpts with scores. The outputs are semantically aligned with the task.

  • Reasoning Quality: 2/2

    Excellent

    The explanations correctly connect the rating skew and sentiment trends, and the interpretation of VADER scores is consistent with the outputs shown.

  • Reliability: 1/1

    Excellent

    The workflow is reasonably robust and uses standard libraries and straightforward logic, with no obvious hallucinated steps or unstable assumptions.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.