Sentiment Analysis of Amazon Reviews

Analyze sentiment in Amazon product reviews using VADER and TextBlob, visualize score distributions, and identify most positive and negative reviews.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads a sample of the Amazon Fine Food Reviews dataset from a URL and summarizes dataset shape and star rating distribution. It generates VADER sentiment scores for each review and adds them as new columns for analysis. It visualizes sentiment score distributions and their relationship to star ratings, then extracts the most positive and most negative review excerpts with scores.

Who this example is for

This is for analysts and data scientists who want a reproducible notebook pattern for sentiment scoring and basic validation against existing labels like star ratings. It is also useful for NLP learners comparing lexicon-based sentiment methods and reviewing edge cases by inspecting extreme examples.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset shape and star rating histogram
  • VADER sentiment scores appended as new columns
  • Histogram of compound sentiment scores
  • Scatter plot comparing sentiment scores to star ratings
  • Three most positive and three most negative review excerpts with sentiment scores

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • nltk
  • vaderSentiment
  • textblob
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution
  2. 2compute sentiment scores using VADER for each review
  3. 3plot sentiment score distribution and compare with star ratings
  4. 4show the 3 most positive and 3 most negative reviews

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next9/102/22/23/31/21/1Open qwen3-coder-next conversation
glm-5.18/102/22/23/31/20/1Open glm-5.1 conversation
gpt-5.48/101/22/22/32/21/1Open gpt-5.4 conversation
qwen3.5:397b4/101/21/21/31/20/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Sentiment Analysis of Amazon Reviews. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution

Let’s start by loading the file and doing the first inspection cleanly. This will show the dataset shape, columns, and first rows so we can confirm the rating column before summarizing its distribution next.

Load and inspect dataset
import pandas as pd
from IPython.display import display

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv"
df = pd.read_csv(url)

display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))
display(pd.DataFrame({"column": df.columns}))

df.head()
rows columns
0 10000 10
column
0 Id
1 ProductId
2 UserId
3 ProfileName
4 HelpfulnessNumerator
5 HelpfulnessDenominator
6 Score
7 Time
8 Summary
9 Text
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Text
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Good Quality Dog Food I have bought several of the Vitality canned d...
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000 Not as Advertised Product arrived labeled as Jumbo Salted Peanut...
2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres "Natalia Corres" 1 1 4 1219017600 "Delight" says it all This is a confection that has been around a fe...
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200 Cough Medicine If you are looking for the secret ingredient i...
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham "M. Wassir" 0 0 5 1350777600 Great taffy Great taffy at a great price.  There was a wid...

From the current output, the dataset has 10,000 rows and 10 columns, and the rating field is confirmed to be Score. The missing piece for your request is the actual rating distribution.

Count score values
score_dist = df["Score"].value_counts().sort_index()
display(score_dist.to_frame("count"))

score_dist
count
Score
1 932
2 590
3 862
4 1433
5 6183
Score
1     932
2     590
3     862
4    1433
5    6183
Name: count, dtype: int64

The dataset has 10,000 rows and 10 columns.

The rating distribution from the Score column is:

  • 1: 932
  • 2: 590
  • 3: 862
  • 4: 1,433
  • 5: 6,183

This shows the reviews are strongly skewed toward high ratings, with 5-star reviews making up the clear majority of the dataset.

compute sentiment scores using VADER for each review
Thinking

Planning VADER computation

I need to focus on the user's request to compute sentiment scores for reviews using VADER. The first step seems to be installing and importing VADER, which requires checking if NLTK is set up. I should download the vader lexicon if it's not already done. My goal is to calculate sentiment scores for each review; I can do this in a straightforward manner. I want to ensure clarity with my code and avoid multiple blocks by using a single code block. I also need to handle any non-string values appropriately.Finalizing VADER implementation

I'm wrapping up my plan to compute VADER sentiment scores. I want to ensure my final step just returns the relevant columns, like df[["Text","vader_neg"...]]. I should mention that VADER is particularly effective for shorter review texts. Next, I’ll add the VADER scores (negative, neutral, positive, and compound) for each review. I might need to include an import for nltk and check if it's installed; I can use a try/except block for that. Since I can only provide one code block, I'll also include a conditional install if necessary.Implementing VADER with error handling

I'm planning to handle potential import errors for the SentimentIntensityAnalyzer. I’ll start by trying to import it and nltk. If there's an ImportError, I'll install nltk using subprocess, then perform the import. Following that, I need to download the VADER lexicon if it's not already available. After ensuring everything is set up, I can create the analyzer and compute the scores from the "Text" column. Finally, I’ll convert those scores into a DataFrame and join it with a "vader_" prefix.

df is already loaded and has the review text in Text, so the next step is to compute VADER sentiment for each review and add the four standard scores: neg, neu, pos, and compound.

Compute VADER sentiment scores
try:
    import nltk
    from nltk.sentiment import SentimentIntensityAnalyzer
except ImportError:
    import sys
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "nltk"])
    import nltk
    from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download("vader_lexicon", quiet=True)

sia = SentimentIntensityAnalyzer()

vader_scores = df["Text"].fillna("").astype(str).apply(sia.polarity_scores).apply(pd.Series)
vader_scores = vader_scores.rename(columns={
    "neg": "vader_neg",
    "neu": "vader_neu",
    "pos": "vader_pos",
    "compound": "vader_compound"
})

df = df.join(vader_scores)

df[["Score", "Text", "vader_neg", "vader_neu", "vader_pos", "vader_compound"]].head()
Score Text vader_neg vader_neu vader_pos vader_compound
0 5 I have bought several of the Vitality canned d... 0.000 0.695 0.305 0.9441
1 1 Product arrived labeled as Jumbo Salted Peanut... 0.138 0.862 0.000 -0.5664
2 4 This is a confection that has been around a fe... 0.091 0.754 0.155 0.8265
3 2 If you are looking for the secret ingredient i... 0.000 1.000 0.000 0.0000
4 5 Great taffy at a great price.  There was a wid... 0.000 0.552 0.448 0.9468
Thinking

Analyzing sentiment scores

I need to provide an answer based on the latest outputs by computing sentiment scores using VADER for each review. The latest output indicates that the DataFrame now has VADER columns included, which is enough to address the goal. VADER scores have been computed for each review, and the sample rows show that a 5-star review has a high positive compound score, while a 1-star review has a negative compound score, and a 2-star review is neutral with a score of 0. I'll keep the explanation concise without excessive restatement of visible outputs.

VADER sentiment scores have been computed for each review and added to df as four new columns:

  • vader_neg
  • vader_neu
  • vader_pos
  • vader_compound

From the observed rows, the scores look sensible relative to the review ratings:

  • a 5-star review shows strongly positive sentiment with high positive and compound scores
  • a 1-star review shows negative sentiment with a negative compound score
  • one 2-star review appears mostly neutral by VADER, which can happen when the wording is less overtly emotional even if the rating is low

So your review dataset now contains both the original Score rating and VADER-based text sentiment for each review, ready for comparison or further analysis.

plot sentiment score distribution and compare with star ratings
Thinking

Creating a boxplot

I want to create a boxplot of the vader_compound data by Score, and I think adding a horizontal line at 0 would be helpful as a reference point. One chart will be sufficient for this analysis. I’ll ensure it clearly visualizes the distribution of vader_compound scores so we can better understand the data. Once I’ve got everything in place, I’ll check to make sure it’s displaying correctly!

A good first comparison is the distribution of VADER compound sentiment across the 1–5 star ratings. If VADER aligns well with the stars, higher ratings should shift upward.

Plot sentiment boxplot
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(9, 5))
sns.boxplot(data=df, x="Score", y="vader_compound")
plt.axhline(0, color="gray", linestyle="--", linewidth=1)
plt.title("VADER Compound Sentiment by Star Rating")
plt.xlabel("Star Rating")
plt.ylabel("VADER Compound Score")
plt.ylim(-1, 1)
plt.show()
Thinking

Calculating summary statistics

I'm working on summarizing the data by grouping it based on "Score." I'm calculating several statistics for "vader_compound," including count, mean, median, and the 25th and 75th percentiles, and rounding the results to three decimal places. I’m wondering if I should include the standard deviation, but I'm leaning towards keeping it minimal. Ultimately, I want to end with a concise summary expression. This looks good!

The plot is in place, and the earlier sample rows already suggest the expected pattern: 5-star reviews tend to have strongly positive compound scores, while 1-star reviews can be negative. To compare the star ratings more concretely, the missing piece is a compact numeric summary behind the boxplot.

Summarize sentiment by score
summary = (
    df.groupby("Score")["vader_compound"]
    .agg(
        count="size",
        mean="mean",
        median="median",
        q25=lambda s: s.quantile(0.25),
        q75=lambda s: s.quantile(0.75),
    )
    .round(3)
    .reset_index()
)

summary
Score count mean median q25 q75
0 1 932 0.009 0.000 -0.620 0.621
1 2 590 0.335 0.527 -0.113 0.858
2 3 862 0.479 0.718 0.102 0.900
3 4 1433 0.720 0.862 0.661 0.943
4 5 6183 0.778 0.886 0.750 0.948

The sentiment distribution does track the star ratings, with a clear upward shift in VADER compound scores as ratings increase.

From the numeric summary:

  • 1-star reviews are centered near neutral: median 0.000, with a very wide spread from clearly negative to clearly positive (q25 = -0.620, q75 = 0.621).
  • 2-star reviews are already somewhat positive overall: median 0.527, mean 0.335.
  • 3-star reviews are more positive than neutral: median 0.718, mean 0.479.
  • 4-star reviews are strongly positive: median 0.862, mean 0.720.
  • 5-star reviews are the most positive: median 0.886, mean 0.778.

What the boxplot and summary together mean:

  • There is a strong positive relationship between star rating and sentiment score.
  • The biggest separation is between 1-star and 4–5 star reviews.
  • 4-star and 5-star reviews are both tightly concentrated in high positive sentiment, so sentiment distinguishes low vs high ratings better than it separates the very top ratings.
  • 1-star reviews are noisy: although they include many negative texts, the median at 0.000 and broad interquartile range suggest plenty of mixed, factual, or lexicon-misaligned wording.
  • The dataset is also imbalanced toward 5-star reviews, so the overall sentiment distribution in the full sample will naturally skew positive.

In short: the sentiment score distribution aligns well with star ratings, especially for distinguishing negative/low-rated reviews from positive/high-rated ones, though there is noticeable overlap in the middle ratings and some unexpected neutrality/positivity among 1-star reviews.

show the 3 most positive and 3 most negative reviews
Thinking

Creating review display code

I need to show the user the three most positive and three most negative reviews in one code block. I should use the notebook state effectively. I'm thinking of sorting the reviews by their vader compound score and displaying relevant columns like Score and Summary. It’s important to be concise and avoid multiple code blocks. I'll use display to show the tables separately, and it looks like one final bare expression isn't necessary. Let’s keep it simple!

We can pull the extremes by the vader_compound score and show the star rating alongside the review text.

Show most positive and negative reviews
cols = ["Score", "Summary", "Text", "vader_compound"]

most_positive = df.nlargest(3, "vader_compound")[cols].reset_index(drop=True)
most_negative = df.nsmallest(3, "vader_compound")[cols].reset_index(drop=True)

display(most_positive)
display(most_negative)
Score Summary Text vader_compound
0 1 NOT what I originally ordered! BOO, amazon! As another reviewer stated, many of the review... 0.9997
1 5 At last we find a food fit for human consumption And we find her here upon this grand amazon!<b... 0.9997
2 5 C H I P.....C H I P.....H O O R A Y....!!!!!  ... I was getting VERY worried, when I read some o... 0.9989
Score Summary Text vader_compound
0 1 NOT GLUTEN FREE! After eating this product, I had a stomach ach... -0.9875
1 1 PLEASE!!!!!Don't waste your money Bought Uncle Ray's Wavy's 16 oz potato chips b... -0.9824
2 5 The best bad breath and no fowl after taste Took me years to find the ultimate chewing gum... -0.9803

Here are the 3 most positive and 3 most negative reviews by vader_compound score.

Most positive:

  1. Score 1“NOT what I originally ordered! BOO, amazon!”
    Sentiment score: 0.9997
    This is a clear mismatch between the review’s true meaning and VADER’s lexicon-based sentiment: words like “BOO” and emphatic language pushed the score to extremely positive even though the review is negative.

  2. Score 5“At last we find a food fit for human consumption”
    Sentiment score: 0.9997
    Strongly enthusiastic wording, which fits both the text and the 5-star rating.

  3. Score 5“C H I P.....C H I P.....H O O R A Y....!!!!! ...”
    Sentiment score: 0.9989
    Very expressive positive language and punctuation, matching the high star rating.

Most negative:

  1. Score 1“NOT GLUTEN FREE!”
    Sentiment score: -0.9875
    Strongly negative wording, well aligned with the 1-star rating.

  2. Score 1“PLEASE!!!!!Don't waste your money”
    Sentiment score: -0.9824
    Very negative and cautionary language, also aligned with the low rating.

  3. Score 5“The best bad breath and no fowl after taste”
    Sentiment score: -0.9803
    Another obvious sentiment-model mismatch: despite the 5-star rating, phrases like “bad breath” drove VADER strongly negative even though the review appears favorable overall.

So the output does answer your request, and it also reveals an important pattern: the most extreme sentiment scores include both genuinely extreme reviews and a few clear VADER misclassifications caused by sarcasm, unusual phrasing, or sentiment-loaded words used in a positive context.

Score cards for gpt-5.4 for Sentiment Analysis of Amazon Reviews

Total score: 8/10

  • Task Completion: 1/2

    Needs work

    The workflow completed loading, rating distribution, VADER scoring, and identifying positive/negative reviews, but it did not show the requested histogram of compound scores or a scatter plot versus star rating.

  • Execution Correctness: 2/2

    Excellent

    The code shown is coherent and likely runnable: it loads the CSV, computes VADER scores, joins them to the dataframe, and produces plots and review excerpts without obvious syntax errors.

  • Output Quality: 2/3

    Good

    It correctly reports dataset shape, rating counts, adds VADER columns, and displays the 3 most positive and 3 most negative reviews. However, the expected compound-score histogram and star-rating comparison plot are missing, so the output is incomplete.

  • Reasoning Quality: 2/2

    Excellent

    The explanations are generally correct and consistent with the outputs, including the skew toward 5-star reviews and the interpretation of VADER scores. The reasoning is somewhat shallow and does not address the missing requested visualizations.

  • Reliability: 1/1

    Excellent

    The workflow is reasonably consistent and uses standard libraries and methods. It is somewhat fragile in presentation because it omits one of the core requested plots and relies on a boxplot instead.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.