Text Data EDA in Python

Explore a text dataset with word frequency analysis, top bigrams, text length distribution, and TF-IDF keyword extraction.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads a 20 Newsgroups subset from scikit-learn with four categories and summarizes document counts by category. It analyzes basic text characteristics by plotting text length distributions across categories and computing the most frequent words after stopword removal. It also extracts TF-IDF keywords per category to surface terms that distinguish each group.

Who this example is for

This is for analysts and data scientists who need a repeatable exploratory data analysis template for multi-class text datasets. It helps validate dataset balance and inspect category-specific vocabulary before building NLP models.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • Bar chart of document counts per category
  • Box plot of text lengths per category
  • Horizontal bar chart of top word frequencies after stopword removal
  • Table of top TF-IDF keywords per category

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load 20 newsgroups dataset (20 Newsgroups from sklearn) with 4 categories and show document count per category fetch_20newsgroups(categories=['sci.space','rec.sport.hockey','comp.graphics','talk.politics.misc'])
  2. 2plot text length distribution across categories
  3. 3show top 20 most frequent words after removing stopwords
  4. 4extract top TF-IDF keywords per category

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
glm-5.110/102/22/23/32/21/1Open glm-5.1 conversation
gpt-5.410/102/22/23/32/21/1Open gpt-5.4 conversation
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
gemma4:31b8/102/22/22/31/21/1Open gemma4:31b conversation
qwen3-coder-next6/102/21/22/31/20/1Open qwen3-coder-next conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Text Data EDA in Python. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load 20 newsgroups dataset (20 Newsgroups from sklearn) with 4 categories and show document count per category fetch_20newsgroups(categories=['sci.space','rec.sport.hockey','comp.graphics','talk.politics.misc'])
Thinking

The user wants to load the 20 Newsgroups dataset from sklearn using four specific categories (sci.space, rec.sport.hockey, comp.graphics, talk.politics.misc) and display the document count per category.

Plan:

  1. Import fetch_20newsgroups from sklearn.datasets.
  2. Load the dataset with the specified categories.
  3. Extract the target labels.
  4. Count the occurrences of each label using pandas or collections.Counter.
  5. Display the counts.

Following the rules:

  • One runnable Python code block.
  • Minimal code.
  • First-load inspection should show shape, columns, and head() if it's a DataFrame, but since fetch_20newsgroups returns a bunch object, I will convert the target to a series/dataframe for counting.
  • Since the user specifically asked for document count per category, I will focus on that.

Step 1: Load the data and count documents per category.

I'll start by loading the 20 Newsgroups dataset with the four specified categories and calculating the document counts.

Count category distributions
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

# Load the dataset with specified categories
categories = ['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.misc']
data = fetch_20newsgroups(subset='all', categories=categories)

# Create a Series for counts
counts = pd.Series(data.target).value_counts().sort_index()
counts.index = [data.target_names[i] for i in counts.index]

counts
comp.graphics         973
rec.sport.hockey      999
sci.space             987
talk.politics.misc    775
Name: count, dtype: int64
Thinking

The user wants to load the 20 newsgroups dataset with 4 specific categories and show the document count per category. Looking at the notebook state:

  • The dataset was loaded using fetch_20newsgroups(subset='all', categories=['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.misc']).
  • The output shows the counts:
    • comp.graphics: 973
    • rec.sport.hockey: 999
    • sci.space: 987
    • talk.politics.misc: 775

The original goal was "load 20 newsgroups dataset... and show document count per category". This has been achieved. The output provides the exact counts for each of the four categories requested.

Decision: Path A (Sufficient).

The 20 Newsgroups dataset has been loaded for the four requested categories. The distribution of documents is relatively balanced, with rec.sport.hockey being the largest category (999 documents) and talk.politics.misc being the smallest (775 documents).

plot text length distribution across categories
Thinking

The user wants to plot the text length distribution across the four categories of the 20 Newsgroups dataset.

- `data`: A `Bunch` object from `fetch_20newsgroups`. It contains `data.data` (the text of the documents) and `data.target` (the labels).
- `categories`: The list of category names.
- `counts`: A series showing counts per category.

- I need to calculate the length of each document in `data.data`.
- I need to associate these lengths with their respective categories from `data.target`.
- A DataFrame would be the easiest way to handle this.
- The user wants a distribution plot. A boxplot or a KDE/histogram (using seaborn) would be appropriate to compare distributions across categories. A boxplot or violin plot is usually better for comparing distributions across a few categorical groups.

- Use `seaborn` and `matplotlib`.
- Create a DataFrame with `text_length` and `category`.
- Plot using `sns.boxplot` or `sns.kdeplot`. Given "distribution", a KDE plot or a boxplot is standard. I'll use a boxplot for a clear summary and perhaps overlay it or use a distribution plot. Actually, `sns.boxplot` or `sns.violinplot` is very clear for category comparison. Let's go with a boxplot to see the range and outliers.

- I should ensure the categories are human-readable (using `data.target_names`).

- `lengths = [len(text) for text in data.data]`
- `df = pd.DataFrame({'length': lengths, 'category': [data.target_names[i] for i in data.target]})`
- `sns.boxplot(data=df, x='category', y='length')`

- One code block? Yes.
- Seaborn? Yes.
- One chart? Yes.
- No `$`? Not relevant here (no money).
- Notebook state used? Yes (`data` is available).

I'll create a DataFrame containing the length of each document and its corresponding category, then use a boxplot to visualize the distribution.

Plot text length distribution
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate text lengths
lengths = [len(text) for text in data.data]
category_names = [data.target_names[i] for i in data.target]

df_lengths = pd.DataFrame({'length': lengths, 'category': category_names})

# Plot the distribution
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_lengths, x='category', y='length')
plt.title('Text Length Distribution across Categories')
plt.xlabel('Category')
plt.ylabel('Character Count')
plt.xticks(rotation=15)
plt.show()
Thinking

The user wants to plot the text length distribution across categories. Looking at the notebook state:

  • Cell 1 loaded the data and showed counts per category.
  • Cell 2 created a df_lengths DataFrame and plotted a boxplot of lengths by category.
  • The output of Cell 2 is <Figure size 1000x600 with 1 Axes>, meaning the plot was generated.

However, the "Sufficiency heuristics" state: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first and print the values used in the chart."

The current output is just the figure. I don't have the descriptive statistics (median, quartiles, max, min) for the text lengths per category. To provide a comprehensive answer and follow the strict heuristic, I should provide the numeric summary of the distribution.

Therefore, I must choose Path B to gather the numeric evidence (descriptive statistics) that supports the visual plot.

Goal: Plot text length distribution across categories. Current state: Plot created, but no numeric summary. Next step: Calculate and display the descriptive statistics for the lengths per category.

The boxplot provides a visual overview of the text length distribution, but to fully understand the variance and central tendency for each category, I will calculate the descriptive statistics.

Summarize length by category
df_lengths.groupby('category')['length'].describe()
count mean std min 25% 50% 75% max
category
comp.graphics 973.0 1806.291881 5607.894090 125.0 568.0 880.0 1416.0 61773.0
rec.sport.hockey 999.0 1853.873874 3204.602700 162.0 766.5 1229.0 1943.0 75154.0
sci.space 987.0 1941.026342 3334.464406 166.0 811.5 1229.0 1931.5 57022.0
talk.politics.misc 775.0 2810.771613 4870.876080 160.0 1062.5 1640.0 2591.0 49447.0
Thinking

The user wants to "plot text length distribution across categories". The notebook state shows:

  1. Data loaded: fetch_20newsgroups with 4 categories.
  2. A boxplot was created (cell-2) showing the distribution of text lengths across categories.
  3. Descriptive statistics (cell-3) were calculated, providing mean, std, min, 25%, 50%, 75%, and max for each category.

The original goal was simply to "plot text length distribution across categories". The visual (boxplot) is already produced. The numeric summary (describe()) provides the necessary evidence to support the visualization.

Looking at the statistics:

  • talk.politics.misc has the highest mean (~2810) and median (1640).
  • comp.graphics has the lowest mean (~1806) and median (880).
  • There are significant outliers in all categories (max values reach 49k-75k characters while 75th percentiles are below 2600).

Since both a visualization and the supporting numeric values have been provided, the original goal is fully satisfied.

Decision: Path A.

The text length distribution varies notably across the selected categories. talk.politics.misc contains the longest documents on average, with a mean of approximately 2,811 characters and a median of 1,640. In contrast, comp.graphics has the shortest typical documents, with a median of 880 characters. All categories exhibit significant right-skewness, as evidenced by the large gap between the 75th percentile (all under 2,600 characters) and the maximum values, which reach up to 75,154 characters in rec.sport.hockey.

show top 20 most frequent words after removing stopwords
Thinking
  • Current state: Data is loaded into df_lengths (containing lengths and categories) and the original data object from fetch_20newsgroups is available.
    • User goal: Show top 20 most frequent words after removing stopwords.

    • Context: This is a text analysis task on the 20 Newsgroups dataset.

    • Available data: data.data contains the raw text of the documents.

    • Combine all text documents into one large corpus (or process them individually).

    • Tokenize the text (splitting by whitespace or using a library).

    • Remove stopwords.

    • Count frequencies.

    • Pick the top 20.

    • Need a list of stopwords. NLTK is standard, but since I can't be sure about the environment's installations, I'll check for common libraries. Actually, in a standard notebook for this, nltk or sklearn's basic tools are common. I'll use nltk as it's the gold standard for stopwords.

    • Preprocessing: Convert to lowercase, remove punctuation.

    • Step 1: Process text, remove stopwords, and count frequencies.

    • Wait... the instructions say "Prefer one execution step at a time." and "If multiple code actions are required, do only the first action now".

    • Wait, "show top 20 most frequent words" is a single objective. The processing is a prerequisite.

    • Imports: collections.Counter, nltk (for stopwords).

    • Logic:

      1. Join all text in data.data.
      2. Normalize (lowercase, remove punctuation).
      3. Tokenize.
      4. Filter out stopwords.
      5. Use Counter to get the top 20.
    • Since I don't know if nltk stopwords are downloaded, I should probably download them in the code block.

    • Code:

      import nltk
      from nltk.corpus import stopwords
      from collections import Counter
      import re
      
      nltk.download('stopwords')
      stop_words = set(stopwords.words('english'))
      
      # Combine all text
      all_text = ' '.join(data.data).lower()
      # Remove punctuation and split
      words = re.findall(r'\b\w+\b', all_text)
      # Filter stopwords
      filtered_words = [w for w in words if w not in stop_words]
      
      # Top 20
      word_counts = Counter(filtered_words).most_common(20)
      pd.DataFrame(word_counts, columns=['word', 'count'])
      
    • One Python code block? Yes.

    • No identifier/leakage columns? Not training an ML model here, just analysis.

    • Correct formatting? Yes.

    • Wait, I should probably make the code clean.

    • Self-Correction: The user wants to "show top 20 most frequent words". I'll perform the text cleaning and counting in one go as it's the logical sequence for this specific request.

To find the most frequent words, I will combine all the documents, tokenize the text, and filter out common English stopwords using the nltk library.

Notebook code
import nltk
from nltk.corpus import stopwords
from collections import Counter
import re

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Combine all text and convert to lowercase
all_text = ' '.join(data.data).lower()

# Extract words using regex to remove punctuation
words = re.findall(r'\b\w+\b', all_text)

# Remove stopwords
filtered_words = [w for w in words if w not in stop_words]

# Get top 20 most common words
top_20 = Counter(filtered_words).most_common(20)
pd.DataFrame(top_20, columns=['word', 'count'])
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/olunia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
word count
0 edu 7573
1 1 6756
2 0 6482
3 2 4485
4 subject 4009
5 lines 3944
6 organization 3783
7 would 3435
8 3 3361
9 com 3343
10 writes 2855
11 4 2636
12 article 2617
13 one 2610
14 space 2166
15 university 2085
16 5 2071
17 like 2062
18 posting 1913
19 people 1909
Thinking

The user wants to see the top 20 most frequent words after removing stopwords. Looking at the notebook state:

  • [cell-4] was executed.
  • It combined all text, tokenized, removed NLTK English stopwords, and calculated the top 20 most common words.
  • [output-4] shows a DataFrame with the top 20 words and their counts.

Observations from [output-4]:

  • The top words are edu (7573), 1 (6756), 0 (6482), 2 (4485), subject (4009), lines (3944), organization (3783), etc.
  • Many of these words are "noise" related to the email headers (metadata) of the 20 Newsgroups dataset (e.g., edu, subject, lines, organization, numbers). These are not meaningful semantic words of the content.

The goal was "show top 20 most frequent words after removing stopwords". The code did exactly that. However, a data analyst would notice that the results are dominated by metadata.

Wait, the "Original user goal" is "show top 20 most frequent words after removing stopwords". The latest output [output-4] provides exactly this list. The sufficiency heuristic says: "If current outputs are sufficient to answer the original goal, provide a natural final answer...".

Does it answer the goal? Yes. Is it helpful? It's technically correct based on the request, but the results are polluted. However, the instructions say: "Do not introduce new side goals, tangents, or optional explorations that were not requested." and "Stay strictly within the original user goal."

The goal was to show the top 20 words. They have been shown. If I suggest removing metadata, I might be going beyond the goal. But as an analyst, identifying that edu, subject, and lines are metadata is a concrete insight.

Let's evaluate the output. The top 20 words are listed. Insight: The list is dominated by email header terms (like edu, subject, lines, organization) and digits, indicating that the documents contain raw email headers which were not stripped before tokenization.

Is Path A appropriate? Yes, the goal "show top 20 most frequent words" is technically met. But if the user wants meaningful words, they might be disappointed. However, the prompt asks to analyze against the original goal. The original goal is a simple "show top 20...".

I will provide the final answer based on the observed output.

Top words:

  1. edu: 7573
  2. 1: 6756
  3. 0: 6482
  4. 2: 4485
  5. subject: 4009 ... and so on.

Insights:

  • The most frequent "words" are actually metadata from email headers (e.g., edu, subject, lines, organization) and numeric digits.
  • This suggests the dataset contains raw email formats where header information is treated as part of the text.
  • Words like space (2,166) appear in the top 20, reflecting the sci.space category.

I will use Path A.

The top 20 most frequent words are dominated by email header metadata and numeric digits rather than thematic content. Terms such as edu (7,573), subject (4,009), lines (3,944), and organization (3,783) appear most often, reflecting the raw email format of the 20 Newsgroups dataset. Numeric strings like 1, 0, and 2 also rank highly. Among the more descriptive words, space (2,166) is the most frequent, likely driven by the sci.space category.

extract top TF-IDF keywords per category
Thinking

The user wants to extract the top TF-IDF keywords for each of the four categories (sci.space, rec.sport.hockey, comp.graphics, talk.politics.misc).

*   `data` object from `fetch_20newsgroups` is available.
*   `data.data` contains the documents.
*   `data.target` contains the labels.
*   `data.target_names` contains the category names.

*   Need `TfidfVectorizer` from `sklearn.feature_extraction.text`.
*   Need to initialize the vectorizer.
*   Fit-transform the `data.data`.
*   Iterate through each category.
*   Average the TF-IDF scores for all documents within a specific category to get the representative keywords for that category.
*   Sort and pick the top $N$ words (e.g., top 10).

*   Step 1: Initialize `TfidfVectorizer` and transform the data.
*   Step 2: Compute the mean TF-IDF score per feature for each class.
*   Step 3: Map feature indices back to words and display the results.

*   Use `TfidfVectorizer`.
*   Avoid including metadata (headers, footers) if possible, though the previous "most frequent words" step showed a lot of metadata (`edu`, `subject`, `lines`). The `TfidfVectorizer` might still pick these up if they aren't consistently distributed, but usually, TF-IDF helps penalize words that appear in every document. However, since `fetch_20newsgroups` has specific `remove` parameters, I could have suggested using those, but since I'm working with the already loaded `data`, I'll proceed with the vectorizer.
*   Actually, `TfidfVectorizer` handles stopwords if passed `stop_words='english'`.

*   Create a dictionary or DataFrame to store results.
*   Loop through categories.
*   Extract top words.

To extract the top TF-IDF keywords per category, I will use TfidfVectorizer to calculate the importance of words relative to each category, then average the scores for all documents in each group to find the most characteristic terms.

Extract top category keywords
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Initialize TfidfVectorizer with English stopwords
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(data.data)
feature_names = tfidf.get_feature_names_out()

# Compute mean TF-IDF scores for each category
category_keywords = {}
for i, category_name in enumerate(data.target_names):
    # Get indices of documents belonging to this category
    indices = np.where(data.target == i)[0]
    
    # Average TF-IDF scores for these documents
    category_tfidf = tfidf_matrix[indices].mean(axis=0).A1
    
    # Get the top 10 words
    top_indices = category_tfidf.argsort()[-10:][::-1]
    category_keywords[category_name] = [feature_names[idx] for idx in top_indices]

# Convert to a readable DataFrame
pd.DataFrame(category_keywords)
comp.graphics rec.sport.hockey sci.space talk.politics.misc
0 graphics game space com
1 edu edu edu edu
2 image ca nasa cramer
3 com hockey henry people
4 lines team access optilink
5 file games com article
6 subject nhl digex writes
7 thanks espn alaska government
8 3d play gov gay
9 files university moon clayton
Thinking

The user goal is to "extract top TF-IDF keywords per category". The notebook state shows that TfidfVectorizer was used, the mean TF-IDF scores per category were calculated, and a DataFrame with the top 10 keywords per category was generated and displayed in [output-5].

Analysis of [output-5]:

  • comp.graphics: graphics, edu, image, com, lines, file, subject, thanks, 3d, files.
  • rec.sport.hockey: game, edu, ca, hockey, team, games, nhl, espn, play, university.
  • sci.space: space, edu, nasa, henry, access, com, digex, alaska, gov, moon.
  • talk.politics.misc: com, edu, cramer, people, optilink, article, writes, government, gay, clayton.

Observations:

  • The keywords successfully distinguish categories (e.g., "hockey", "nhl" for sports; "nasa", "moon" for space; "graphics", "3d" for graphics).
  • However, generic terms like "edu", "com", "lines", and "subject" appear across multiple categories. These are "noise" words common in email headers, which the basic 'english' stopword list did not catch.

The original goal was simply to "extract top TF-IDF keywords per category". The output provides exactly that.

Decision: Path A. The outputs are sufficient. I will provide a summary of the keywords extracted and their meaning.

The TF-IDF analysis successfully extracted keywords that distinguish the four categories:

  • sci.space: Strong thematic keywords such as space, nasa, moon, and alaska clearly identify this category.
  • rec.sport.hockey: Specific terminology like hockey, nhl, espn, and game dominates the top terms.
  • comp.graphics: Technical terms such as graphics, image, 3d, and files characterize this group.
  • talk.politics.misc: This category shows more fragmented results with names like cramer and clayton alongside government, reflecting a discussion-based format.

Across all categories, the presence of terms like edu, com, and subject indicates that email header metadata remains prevalent in the TF-IDF scores, as these terms appear frequently across the entire dataset.

Score cards for gemma4:31b for Text Data EDA in Python

Total score: 8/10

  • Task Completion: 2/2

    Excellent

    All four requested analyses were attempted: category counts, text length distribution, top word frequencies, and TF-IDF keywords per category.

  • Execution Correctness: 2/2

    Excellent

    The code shown is syntactically coherent and the outputs indicate it ran successfully, including the dataset load, plot generation, frequency table, and TF-IDF results.

  • Output Quality: 2/3

    Good

    The notebook produced the expected core outputs, but the TF-IDF result is a table of top 10 keywords per category rather than the requested top 5, and there is no explicit bigram analysis despite the workflow description mentioning it.

  • Reasoning Quality: 1/2

    Needs work

    The reasoning is generally sensible, but it includes unnecessary detours and some mismatch between the stated goal and the implemented outputs, especially around keyword count and the absent bigram step.

  • Reliability: 1/1

    Excellent

    The workflow is mostly consistent and runnable, though it relies on notebook state and includes extra, partially mismatched analysis steps that reduce robustness.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.