Sentiment Analysis of Amazon Reviews

Analyze sentiment in Amazon product reviews using VADER and TextBlob, visualize score distributions, and identify most positive and negative reviews.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads a sample of the Amazon Fine Food Reviews dataset from a URL and summarizes dataset shape and star rating distribution. It generates VADER sentiment scores for each review and adds them as new columns for analysis. It visualizes sentiment score distributions and their relationship to star ratings, then extracts the most positive and most negative review excerpts with scores.

Who this example is for

This is for analysts and data scientists who want a reproducible notebook pattern for sentiment scoring and basic validation against existing labels like star ratings. It is also useful for NLP learners comparing lexicon-based sentiment methods and reviewing edge cases by inspecting extreme examples.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Dataset shape and star rating histogram
  • VADER sentiment scores appended as new columns
  • Histogram of compound sentiment scores
  • Scatter plot comparing sentiment scores to star ratings
  • Three most positive and three most negative review excerpts with sentiment scores

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • pandas
  • nltk
  • vaderSentiment
  • textblob
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution
  2. 2compute sentiment scores using VADER for each review
  3. 3plot sentiment score distribution and compare with star ratings
  4. 4show the 3 most positive and 3 most negative reviews

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gemma4:31b10/102/22/23/32/21/1Open gemma4:31b conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3-coder-next9/102/22/23/31/21/1Open qwen3-coder-next conversation
glm-5.18/102/22/23/31/20/1Open glm-5.1 conversation
gpt-5.48/101/22/22/32/21/1Open gpt-5.4 conversation
qwen3.5:397b4/101/21/21/31/20/1Open qwen3.5:397b conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Sentiment Analysis of Amazon Reviews. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load the reviews dataset https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv and show shape and rating distribution
Thinking
  • Task: Load the amazon fine food reviews dataset from a provided URL.
    • Requirements: Show the shape of the dataframe and the rating distribution.

    • Constraints:

      • Iterative loop.
      • One code block per response.
      • No $ for currency (though not relevant here).
      • Load and inspect first (shape, columns, head) before doing more complex things like plotting, but the user explicitly asked for shape and rating distribution.
      • Wait, the system prompt says: "When a dataset is not yet loaded in current notebook context, first load and inspect only: show shape, columns, and head(). Do not combine first-load data inspection with plotting/modeling in the same response."
    • User Request: Load + Shape + Rating Distribution.

    • System Prompt Rule: "When a dataset is not yet loaded... first load and inspect only... Do not combine first-load data inspection with plotting/modeling in the same response."

    • The user wants the rating distribution (a plot/analysis), but the system rules forbid combining the first load/inspect with plotting.

    • Therefore, I must first load the data, show the shape, columns, and head.

    • Then, in the next turn, I will handle the rating distribution.

    • Load CSV from URL.

    • Display shape.

    • Display columns.

    • Display head.

    • Dataset URL: https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv

    • Variable name: df (simple).

    • Explain that I'll start by loading the data and inspecting the structure first, then I'll handle the rating distribution in the next step.

I'll start by loading the dataset and inspecting its basic structure to ensure everything is correct before we move on to the rating distribution.

Load and preview data
import pandas as pd

url = "https://raw.githubusercontent.com/pplonski/datasets-for-start/refs/heads/master/amazon-fine-food-reviews/amazon_fine_food_reviews_10k.csv"
df = pd.read_csv(url)

print(f"Dataset Shape: {df.shape}")
display(df.head())
display(df.columns)
Dataset Shape: (10000, 10)
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Text
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Good Quality Dog Food I have bought several of the Vitality canned d...
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000 Not as Advertised Product arrived labeled as Jumbo Salted Peanut...
2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres "Natalia Corres" 1 1 4 1219017600 "Delight" says it all This is a confection that has been around a fe...
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200 Cough Medicine If you are looking for the secret ingredient i...
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham "M. Wassir" 0 0 5 1350777600 Great taffy Great taffy at a great price.  There was a wid...
Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='str')
Thinking

The user wants to load the dataset, show its shape, and show the rating distribution. From the previous output:

  • Dataset loaded.
  • Shape is (10000, 10).
  • Columns are identified, and Score appears to be the rating column.

The "rating distribution" part of the goal has not been addressed yet. I need to visualize or count the distribution of the Score column.

Decision: Path B. I need to create the rating distribution.

The dataset contains 10,000 reviews across 10 columns, with the Score column representing the user ratings. To complete the request, I will now plot the distribution of these ratings.

Plot rating distribution
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='Score', palette='viridis')
plt.title('Distribution of Ratings')
plt.xlabel('Rating Score')
plt.ylabel('Count')
plt.show()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_6771/3531474681.py:5: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(data=df, x='Score', palette='viridis')
Thinking
  • Original goal: Load dataset, show shape, and show rating distribution.
    • Executed steps:

      1. Loaded dataset, printed shape (10000, 10), showed head(), and listed columns.
      2. Plotted a count plot of the Score column.
    • Latest state: Shape is known, chart is created.

    • Sufficiency Heuristic: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

    • Observation: The plot shows the distribution, but the exact counts for each rating score are not printed as text/table. To be thorough and satisfy the sufficiency heuristic, I should provide the numerical distribution.

    • Path B: Need the numerical counts of the Score column to complement the chart.

    • Insight: The dataset has 10,000 entries. The rating distribution is visualized, but the precise counts for each score are not yet explicitly listed.

    • Code: df['Score'].value_counts().sort_index()

The dataset contains 10,000 reviews. While the plot shows the general distribution of ratings, providing the exact counts for each score will give a precise understanding of the balance between positive and negative reviews.

Count scores by value
df['Score'].value_counts().sort_index()
Score
1     932
2     590
3     862
4    1433
5    6183
Name: count, dtype: int64
Thinking
  • Original goal: "load the reviews dataset [URL] and show shape and rating distribution".
    • Latest state:

      • Cell 1: Loaded dataset, printed shape (10000, 10), displayed head() and columns.
      • Cell 2: Plotted the distribution of Score using sns.countplot.
      • Cell 3: Printed df['Score'].value_counts().sort_index(), providing exact counts for each rating (1: 932, 2: 590, 3: 862, 4: 1433, 5: 6183).
    • "Load the reviews dataset": Done (Cell 1).

    • "Show shape": Done (Cell 1 output: (10000, 10)).

    • "Rating distribution": Done (Cell 2 plot, Cell 3 numeric counts).

    • The latest output provides both the visual distribution and the precise numeric breakdown of ratings.

    • The shape has been explicitly printed.

    • Everything requested in the original goal has been delivered.

    • The dataset consists of 10,000 reviews.

    • The ratings are heavily skewed towards positive feedback, with 5-star reviews being the most frequent by a significant margin (6,183 reviews).

    • Lower ratings (1-3 stars) are much less common, totaling around 2,384 reviews.

    • Path A: Provide a natural final answer based on observed outputs.

    • Include <!--final answer-->.

    • No Python code.

The dataset was successfully loaded with a shape of 10,000 rows and 10 columns. The rating distribution is heavily skewed toward positive reviews, with 5-star ratings being the most frequent at 6,183 entries, followed by 4-star ratings at 1,433. Lower ratings (1-3 stars) are significantly less common, accounting for the remainder of the dataset.

compute sentiment scores using VADER for each review
Thinking

The user wants to compute sentiment scores using VADER for each review in the dataset. The dataset df is already loaded and contains a column Text (which likely contains the review content).

Plan:

  1. Install/Import nltk and the SentimentIntensityAnalyzer from nltk.sentiment.vader.
  2. Download the VADER lexicon.
  3. Apply the analyzer to the Text column to get the compound score.
  4. Add this score as a new column to the dataframe.

Constraints check:

  • One execution step at a time.
  • Markdown for text, Python for code.
  • Currency formatting rule: not applicable here.
  • No $ in prose for money.
  • ML training safeguards: not training yet, just feature engineering.
  • Visualization: not requested yet.

Step 1: Initialize VADER and calculate the sentiment scores for the Text column.

To compute sentiment scores, I'll use the VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon from the nltk library, which is specifically designed for social media and product reviews.

Calculate sentiment scores
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the VADER lexicon
nltk.download('vader_lexicon')

# Initialize VADER analyzer
analyzer = SentimentIntensityAnalyzer()

# Define a function to get the compound sentiment score
def get_sentiment(text):
    return analyzer.polarity_scores(str(text))['compound']

# Apply the function to the Text column
df['Sentiment_Score'] = df['Text'].apply(get_sentiment)

df[['Text', 'Sentiment_Score']].head()
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/olunia/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
Text Sentiment_Score
0 I have bought several of the Vitality canned d... 0.9441
1 Product arrived labeled as Jumbo Salted Peanut... -0.5664
2 This is a confection that has been around a fe... 0.8265
3 If you are looking for the secret ingredient i... 0.0000
4 Great taffy at a great price.  There was a wid... 0.9468
Thinking

The user wants to compute sentiment scores using VADER for each review. Looking at the notebook state:

  • cell-4 contains the code to initialize VADER, define get_sentiment, and apply it to the Text column to create df['Sentiment_Score'].
  • output-4 shows the results of df[['Text', 'Sentiment_Score']].head(), demonstrating that sentiment scores have been computed and assigned to the reviews (e.g., 0.9441, -0.5664, 0.8265, 0.0000, 0.9468).

The original goal was "compute sentiment scores using VADER for each review". This has been achieved. The Sentiment_Score column now exists in the dataframe for all 10,000 rows.

Decision: Path A.

The sentiment scores have been successfully computed for all reviews using VADER. The results show a range of values from -1 (most negative) to 1 (most positive), with 0 indicating neutral sentiment. For example, the first review has a high positive score of 0.9441, while the second review is negative at -0.5664.

plot sentiment score distribution and compare with star ratings
Thinking

The user wants to plot the distribution of the Sentiment_Score (calculated using VADER) and compare it with the star ratings (Score).

Looking at the notebook state:

  • df is already loaded.
  • Score (star ratings) exists.
  • Sentiment_Score was created in cell-4.

The goal is to visualize the relationship/comparison between these two. A good way to do this is to use a joint plot or two subplots: one for the distribution of sentiment scores (histogram/KDE) and one for the relationship between Score and Sentiment_Score (boxplot or scatter plot). A boxplot of Sentiment_Score grouped by Score is usually the most effective way to compare sentiment across different star ratings.

Plan:

  1. Create a boxplot where x-axis is Score and y-axis is Sentiment_Score.
  2. This will show if higher star ratings correlate with higher sentiment scores.

Wait, the user asked to "plot sentiment score distribution AND compare with star ratings". Maybe a combination: a histogram for the distribution and a boxplot for the comparison. Or a violin plot.

Let's go with a boxplot for the comparison and a histogram/kde for the distribution. To keep it to one chart as per policy, a boxplot is the most direct way to "compare" the two. However, a distribution plot of Sentiment_Score is also requested.

I'll use a figure with two subplots: one for the distribution (hist) and one for the comparison (boxplot). But the instructions say "Default to one chart OR one table per step". Let's start with the boxplot as it serves the "compare" part and implicitly shows the distribution per rating. Actually, a violin plot might be better to see the distribution across ratings.

Let's choose a boxplot for clarity.

Wait, "plot sentiment score distribution" (singular) suggests a histogram. "and compare with star ratings" suggests a relationship. I will provide a boxplot of Sentiment_Score by Score. This allows seeing the distribution of sentiment for each star rating.

Actually, let's stick to the "one chart" rule. I'll create a boxplot. It's the best way to compare a continuous variable (Sentiment_Score) across categorical-like ordered groups (Score).

Step: Use seaborn to create a boxplot of Sentiment_Score for each Score.

To compare the VADER sentiment scores with the star ratings, a boxplot is the most effective visualization. It allows us to see the distribution of sentiment scores for each rating level and verify if higher star ratings actually correspond to more positive sentiment.

Plot sentiment score distribution
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Score', y='Sentiment_Score', palette='coolwarm')
plt.title('Sentiment Score Distribution by Star Rating')
plt.xlabel('Star Rating (Score)')
plt.ylabel('VADER Sentiment Score')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_6771/663494595.py:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=df, x='Score', y='Sentiment_Score', palette='coolwarm')
Thinking

The user wants to plot sentiment score distribution and compare it with star ratings. Looking at the notebook state:

  • Cell 4 computed Sentiment_Score using VADER.
  • Cell 5 created a boxplot of Sentiment_Score grouped by Score (star rating).
  • Output 5 shows the figure.

The original goal was to "plot sentiment score distribution and compare with star ratings". The visual is present (boxplot), but the "Sufficiency heuristics" state: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

The user has a plot, but not the actual numeric summary (e.g., median sentiment per star rating) to support the visual conclusion with data.

Path B is required to provide the numeric summary.

The boxplot shows a clear positive correlation between star ratings and VADER sentiment scores, with the median sentiment increasing as the rating increases. To quantify this relationship and provide a precise comparison, I will calculate the median and mean sentiment scores for each rating level.

Summarize sentiment by score
df.groupby('Score')['Sentiment_Score'].agg(['mean', 'median', 'std']).round(3)
mean median std
Score
1 0.009 0.000 0.646
2 0.335 0.527 0.585
3 0.479 0.718 0.536
4 0.720 0.862 0.358
5 0.778 0.886 0.302
Thinking

The user wants to "plot sentiment score distribution and compare with star ratings".

Looking at the notebook state:

  1. cell-4: Computed Sentiment_Score using VADER.
  2. cell-5: Plotted a boxplot showing Sentiment_Score distribution across different Score (star ratings).
  3. cell-6: Calculated the mean, median, and std of Sentiment_Score grouped by Score.

Observations from output-6:

  • Score 1: Mean 0.009, Median 0.000
  • Score 2: Mean 0.335, Median 0.527
  • Score 3: Mean 0.479, Median 0.718
  • Score 4: Mean 0.720, Median 0.862
  • Score 5: Mean 0.778, Median 0.886

Insight: There is a strong positive correlation between star ratings and sentiment scores. As the star rating increases, both the mean and median sentiment scores increase consistently. The standard deviation decreases for higher ratings, suggesting that positive reviews are more consistent in their sentiment than lower-rated reviews.

The goal was to "plot sentiment score distribution and compare with star ratings".

  • Distribution plot provided? Yes (boxplot in cell-5).
  • Comparison with star ratings? Yes (boxplot and aggregation table in cell-6).

The outputs are sufficient to answer the goal. I should provide a final answer summarizing these findings.

The analysis shows a strong positive correlation between star ratings and VADER sentiment scores. The boxplot illustrates a clear upward trend in sentiment as ratings increase, and the summary statistics confirm this: the median sentiment score rises from 0.000 for 1-star reviews to 0.886 for 5-star reviews. Additionally, the sentiment becomes more consistent at higher ratings, as indicated by the decrease in standard deviation from 0.646 (1-star) to 0.302 (5-star).

show the 3 most positive and 3 most negative reviews
Thinking

The user wants to see the 3 most positive and 3 most negative reviews from the dataset. Looking at the current state:

  • df is loaded.
  • Sentiment_Score has been calculated for each review.
  • I need to sort the dataframe by Sentiment_Score to find the top 3 and bottom 3.

Plan:

  1. Sort the dataframe by Sentiment_Score.
  2. Extract the top 3 (highest scores) and bottom 3 (lowest scores).
  3. Display the Text and Sentiment_Score for these reviews.

Wait, since I can only return one code block, I can do this in one go. I'll create two separate slices or use pd.concat to show them together, or just a simple selection. To make it clear, I'll use nlargest and nsmallest.

Columns to show: Text, Sentiment_Score, and maybe Score (star rating) to see if they align.

Step: Retrieve the top 3 and bottom 3 reviews based on Sentiment_Score.

To identify the most extreme reviews, I will extract the rows with the highest and lowest sentiment scores, including the text and the star rating for context.

Get extreme sentiment reviews
# Get the 3 most positive and 3 most negative reviews
top_positive = df.nlargest(3, 'Sentiment_Score')[['Text', 'Sentiment_Score', 'Score']]
top_negative = df.nsmallest(3, 'Sentiment_Score')[['Text', 'Sentiment_Score', 'Score']]

# Combine them for a clear display
extreme_reviews = pd.concat([top_positive, top_negative])
extreme_reviews
Text Sentiment_Score Score
1320 As another reviewer stated, many of the review... 0.9997 1
8756 And we find her here upon this grand amazon!<b... 0.9997 5
539 I was getting VERY worried, when I read some o... 0.9989 5
6504 After eating this product, I had a stomach ach... -0.9875 1
1113 Bought Uncle Ray's Wavy's 16 oz potato chips b... -0.9824 1
3044 Took me years to find the ultimate chewing gum... -0.9803 5
Thinking

The user wants to show the 3 most positive and 3 most negative reviews. Looking at the notebook state, cell-7 has already executed the logic to find these reviews:

  • top_positive (nlargest 3)
  • top_negative (nsmallest 3)
  • Combined into extreme_reviews.

The output of cell-7 shows:

  • Most positive: index 1320 (0.9997), 8756 (0.9997), 539 (0.9989).
  • Most negative: index 6504 (-0.9875), 1113 (-0.9824), 3044 (-0.9803).

The output contains the Text, Sentiment_Score, and Score (star rating). This directly answers the user's request.

Insights from the output:

  • The most positive reviews have sentiment scores very close to 1.0, though interestingly, one of them (index 1320) has a star rating of 1, suggesting a discrepancy between the sentiment of the text and the actual rating given.
  • The most negative reviews have scores close to -1.0, and most of them are associated with 1-star ratings, although one (index 3044) is associated with a 5-star rating, which might indicate sarcasm or a very specific type of negative feedback within a generally positive context.

Decision: Path A. The outputs are sufficient.

The 3 most positive reviews have sentiment scores nearly reaching 1.0, with texts describing great experiences or high praise. Interestingly, while most align with 5-star ratings, one highly positive sentiment score (0.9997) is associated with a 1-star rating, suggesting a potential mismatch between the review text and the score.

The 3 most negative reviews show sentiment scores close to -1.0, reflecting strong dissatisfaction. Similarly, while most of these align with 1-star ratings, one extremely negative review (index 3044) is associated with a 5-star rating, which could indicate sarcasm or a complex review where the overall rating doesn't match the specific sentiment of the text.

Score cards for gemma4:31b for Sentiment Analysis of Amazon Reviews

Total score: 10/10

  • Task Completion: 2/2

    Excellent

    The workflow completed all major requested steps: loaded the dataset, showed shape and rating distribution, computed VADER sentiment scores, plotted sentiment vs. ratings, and identified the 3 most positive and 3 most negative reviews.

  • Execution Correctness: 2/2

    Excellent

    The provided code is coherent and likely runnable: it loads the CSV, computes sentiment with NLTK VADER, and produces the requested plots and review excerpts without obvious syntax errors.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: dataset shape and rating counts are shown, sentiment scores are added, a distribution/comparison plot is produced, and six extreme review excerpts with scores are displayed.

  • Reasoning Quality: 2/2

    Excellent

    The reasoning is clear and aligned with the task, correctly identifying the dataset columns and using VADER appropriately for review sentiment analysis.

  • Reliability: 1/1

    Excellent

    The workflow is consistent and uses standard libraries and straightforward logic, with only minor non-blocking warnings from seaborn.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.