Text Data EDA in Python

Explore a text dataset with word frequency analysis, top bigrams, text length distribution, and TF-IDF keyword extraction.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads a 20 Newsgroups subset from scikit-learn with four categories and summarizes document counts by category. It analyzes basic text characteristics by plotting text length distributions across categories and computing the most frequent words after stopword removal. It also extracts TF-IDF keywords per category to surface terms that distinguish each group.

Who this example is for

This is for analysts and data scientists who need a repeatable exploratory data analysis template for multi-class text datasets. It helps validate dataset balance and inspect category-specific vocabulary before building NLP models.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

Bar chart of document counts per category
Box plot of text lengths per category
Horizontal bar chart of top word frequencies after stopword removal
Table of top TF-IDF keywords per category

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

scikit-learn
pandas
numpy
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load 20 newsgroups dataset (20 Newsgroups from sklearn) with 4 categories and show document count per category fetch_20newsgroups(categories=['sci.space','rec.sport.hockey','comp.graphics','talk.politics.misc'])
2plot text length distribution across categories
3show top 20 most frequent words after removing stopwords
4extract top TF-IDF keywords per category

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
glm-5.1	10/10	2/2	2/2	3/3	2/2	1/1	Open glm-5.1 conversation
gpt-5.4	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-5.4 conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation
gemma4:31b	8/10	2/2	2/2	2/3	1/2	1/1	Open gemma4:31b conversation
qwen3-coder-next	6/10	2/2	1/2	2/3	1/2	0/1	Open qwen3-coder-next conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Text Data EDA in Python. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load 20 newsgroups dataset (20 Newsgroups from sklearn) with 4 categories and show document count per category fetch_20newsgroups(categories=['sci.space','rec.sport.hockey','comp.graphics','talk.politics.misc'])

Thinking

The user wants to load the 20 Newsgroups dataset from sklearn using four specific categories (sci.space, rec.sport.hockey, comp.graphics, talk.politics.misc) and display the document count per category.

Plan:

Import fetch_20newsgroups from sklearn.datasets.
Load the dataset with the specified categories.
Extract the target labels.
Count the occurrences of each label using pandas or collections.Counter.
Display the counts.

Following the rules:

One runnable Python code block.
Minimal code.
First-load inspection should show shape, columns, and head() if it's a DataFrame, but since fetch_20newsgroups returns a bunch object, I will convert the target to a series/dataframe for counting.
Since the user specifically asked for document count per category, I will focus on that.

Step 1: Load the data and count documents per category.

I'll start by loading the 20 Newsgroups dataset with the four specified categories and calculating the document counts.

Count category distributions

from sklearn.datasets import fetch_20newsgroups
import pandas as pd

# Load the dataset with specified categories
categories = ['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.misc']
data = fetch_20newsgroups(subset='all', categories=categories)

# Create a Series for counts
counts = pd.Series(data.target).value_counts().sort_index()
counts.index = [data.target_names[i] for i in counts.index]

counts

comp.graphics         973
rec.sport.hockey      999
sci.space             987
talk.politics.misc    775
Name: count, dtype: int64

Thinking

The user wants to load the 20 newsgroups dataset with 4 specific categories and show the document count per category. Looking at the notebook state:

The dataset was loaded using fetch_20newsgroups(subset='all', categories=['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.misc']).
The output shows the counts:
- comp.graphics: 973
- rec.sport.hockey: 999
- sci.space: 987
- talk.politics.misc: 775

The original goal was "load 20 newsgroups dataset... and show document count per category". This has been achieved. The output provides the exact counts for each of the four categories requested.

Decision: Path A (Sufficient).

The 20 Newsgroups dataset has been loaded for the four requested categories. The distribution of documents is relatively balanced, with rec.sport.hockey being the largest category (999 documents) and talk.politics.misc being the smallest (775 documents).

plot text length distribution across categories

Thinking

The user wants to plot the text length distribution across the four categories of the 20 Newsgroups dataset.

- `data`: A `Bunch` object from `fetch_20newsgroups`. It contains `data.data` (the text of the documents) and `data.target` (the labels).
- `categories`: The list of category names.
- `counts`: A series showing counts per category.

- I need to calculate the length of each document in `data.data`.
- I need to associate these lengths with their respective categories from `data.target`.
- A DataFrame would be the easiest way to handle this.
- The user wants a distribution plot. A boxplot or a KDE/histogram (using seaborn) would be appropriate to compare distributions across categories. A boxplot or violin plot is usually better for comparing distributions across a few categorical groups.

- Use `seaborn` and `matplotlib`.
- Create a DataFrame with `text_length` and `category`.
- Plot using `sns.boxplot` or `sns.kdeplot`. Given "distribution", a KDE plot or a boxplot is standard. I'll use a boxplot for a clear summary and perhaps overlay it or use a distribution plot. Actually, `sns.boxplot` or `sns.violinplot` is very clear for category comparison. Let's go with a boxplot to see the range and outliers.

- I should ensure the categories are human-readable (using `data.target_names`).

- `lengths = [len(text) for text in data.data]`
- `df = pd.DataFrame({'length': lengths, 'category': [data.target_names[i] for i in data.target]})`
- `sns.boxplot(data=df, x='category', y='length')`

- One code block? Yes.
- Seaborn? Yes.
- One chart? Yes.
- No `$`? Not relevant here (no money).
- Notebook state used? Yes (`data` is available).

#10

I'll create a DataFrame containing the length of each document and its corresponding category, then use a boxplot to visualize the distribution.

#11

Plot text length distribution

import seaborn as sns
import matplotlib.pyplot as plt

# Calculate text lengths
lengths = [len(text) for text in data.data]
category_names = [data.target_names[i] for i in data.target]

df_lengths = pd.DataFrame({'length': lengths, 'category': category_names})

# Plot the distribution
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_lengths, x='category', y='length')
plt.title('Text Length Distribution across Categories')
plt.xlabel('Category')
plt.ylabel('Character Count')
plt.xticks(rotation=15)
plt.show()

#12

#13

Thinking

The user wants to plot the text length distribution across categories. Looking at the notebook state:

Cell 1 loaded the data and showed counts per category.
Cell 2 created a df_lengths DataFrame and plotted a boxplot of lengths by category.
The output of Cell 2 is <Figure size 1000x600 with 1 Axes>, meaning the plot was generated.

However, the "Sufficiency heuristics" state: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first and print the values used in the chart."

The current output is just the figure. I don't have the descriptive statistics (median, quartiles, max, min) for the text lengths per category. To provide a comprehensive answer and follow the strict heuristic, I should provide the numeric summary of the distribution.

Therefore, I must choose Path B to gather the numeric evidence (descriptive statistics) that supports the visual plot.

Goal: Plot text length distribution across categories. Current state: Plot created, but no numeric summary. Next step: Calculate and display the descriptive statistics for the lengths per category.

#14

The boxplot provides a visual overview of the text length distribution, but to fully understand the variance and central tendency for each category, I will calculate the descriptive statistics.

#15

Summarize length by category

df_lengths.groupby('category')['length'].describe()

#16

	count	mean	std	min	25%	50%	75%	max
category
comp.graphics	973.0	1806.291881	5607.894090	125.0	568.0	880.0	1416.0	61773.0
rec.sport.hockey	999.0	1853.873874	3204.602700	162.0	766.5	1229.0	1943.0	75154.0
sci.space	987.0	1941.026342	3334.464406	166.0	811.5	1229.0	1931.5	57022.0
talk.politics.misc	775.0	2810.771613	4870.876080	160.0	1062.5	1640.0	2591.0	49447.0

#17

Thinking

The user wants to "plot text length distribution across categories". The notebook state shows:

Data loaded: fetch_20newsgroups with 4 categories.
A boxplot was created (cell-2) showing the distribution of text lengths across categories.
Descriptive statistics (cell-3) were calculated, providing mean, std, min, 25%, 50%, 75%, and max for each category.

The original goal was simply to "plot text length distribution across categories". The visual (boxplot) is already produced. The numeric summary (describe()) provides the necessary evidence to support the visualization.

Looking at the statistics:

talk.politics.misc has the highest mean (~2810) and median (1640).
comp.graphics has the lowest mean (~1806) and median (880).
There are significant outliers in all categories (max values reach 49k-75k characters while 75th percentiles are below 2600).

Since both a visualization and the supporting numeric values have been provided, the original goal is fully satisfied.

Decision: Path A.

#18

The text length distribution varies notably across the selected categories. talk.politics.misc contains the longest documents on average, with a mean of approximately 2,811 characters and a median of 1,640. In contrast, comp.graphics has the shortest typical documents, with a median of 880 characters. All categories exhibit significant right-skewness, as evidenced by the large gap between the 75th percentile (all under 2,600 characters) and the maximum values, which reach up to 75,154 characters in rec.sport.hockey.

#19

show top 20 most frequent words after removing stopwords

#20

Thinking

Current state: Data is loaded into df_lengths (containing lengths and categories) and the original data object from fetch_20newsgroups is available.
- User goal: Show top 20 most frequent words after removing stopwords.
- Context: This is a text analysis task on the 20 Newsgroups dataset.
- Available data: data.data contains the raw text of the documents.
- Combine all text documents into one large corpus (or process them individually).
- Tokenize the text (splitting by whitespace or using a library).
- Remove stopwords.
- Count frequencies.
- Pick the top 20.
- Need a list of stopwords. NLTK is standard, but since I can't be sure about the environment's installations, I'll check for common libraries. Actually, in a standard notebook for this, nltk or sklearn's basic tools are common. I'll use nltk as it's the gold standard for stopwords.
- Preprocessing: Convert to lowercase, remove punctuation.
- Step 1: Process text, remove stopwords, and count frequencies.
- Wait... the instructions say "Prefer one execution step at a time." and "If multiple code actions are required, do only the first action now".
- Wait, "show top 20 most frequent words" is a single objective. The processing is a prerequisite.
- Imports: collections.Counter, nltk (for stopwords).
- Logic:
  1. Join all text in data.data.
  2. Normalize (lowercase, remove punctuation).
  3. Tokenize.
  4. Filter out stopwords.
  5. Use Counter to get the top 20.
- Since I don't know if nltk stopwords are downloaded, I should probably download them in the code block.
- Code:
```
import nltk
from nltk.corpus import stopwords
from collections import Counter
import re

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Combine all text
all_text = ' '.join(data.data).lower()
# Remove punctuation and split
words = re.findall(r'\b\w+\b', all_text)
# Filter stopwords
filtered_words = [w for w in words if w not in stop_words]

# Top 20
word_counts = Counter(filtered_words).most_common(20)
pd.DataFrame(word_counts, columns=['word', 'count'])
```
- One Python code block? Yes.
- No identifier/leakage columns? Not training an ML model here, just analysis.
- Correct formatting? Yes.
- Wait, I should probably make the code clean.
- Self-Correction: The user wants to "show top 20 most frequent words". I'll perform the text cleaning and counting in one go as it's the logical sequence for this specific request.

#21

To find the most frequent words, I will combine all the documents, tokenize the text, and filter out common English stopwords using the nltk library.

#22

Notebook code

import nltk
from nltk.corpus import stopwords
from collections import Counter
import re

# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Combine all text and convert to lowercase
all_text = ' '.join(data.data).lower()

# Extract words using regex to remove punctuation
words = re.findall(r'\b\w+\b', all_text)

# Remove stopwords
filtered_words = [w for w in words if w not in stop_words]

# Get top 20 most common words
top_20 = Counter(filtered_words).most_common(20)
pd.DataFrame(top_20, columns=['word', 'count'])

#23

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/olunia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

#24

	word	count
0	edu	7573
1	1	6756
2	0	6482
3	2	4485
4	subject	4009
5	lines	3944
6	organization	3783
7	would	3435
8	3	3361
9	com	3343
10	writes	2855
11	4	2636
12	article	2617
13	one	2610
14	space	2166
15	university	2085
16	5	2071
17	like	2062
18	posting	1913
19	people	1909

#25

Thinking

The user wants to see the top 20 most frequent words after removing stopwords. Looking at the notebook state:

[cell-4] was executed.
It combined all text, tokenized, removed NLTK English stopwords, and calculated the top 20 most common words.
[output-4] shows a DataFrame with the top 20 words and their counts.

Observations from [output-4]:

The top words are edu (7573), 1 (6756), 0 (6482), 2 (4485), subject (4009), lines (3944), organization (3783), etc.
Many of these words are "noise" related to the email headers (metadata) of the 20 Newsgroups dataset (e.g., edu, subject, lines, organization, numbers). These are not meaningful semantic words of the content.

The goal was "show top 20 most frequent words after removing stopwords". The code did exactly that. However, a data analyst would notice that the results are dominated by metadata.

Wait, the "Original user goal" is "show top 20 most frequent words after removing stopwords". The latest output [output-4] provides exactly this list. The sufficiency heuristic says: "If current outputs are sufficient to answer the original goal, provide a natural final answer...".

Does it answer the goal? Yes. Is it helpful? It's technically correct based on the request, but the results are polluted. However, the instructions say: "Do not introduce new side goals, tangents, or optional explorations that were not requested." and "Stay strictly within the original user goal."

The goal was to show the top 20 words. They have been shown. If I suggest removing metadata, I might be going beyond the goal. But as an analyst, identifying that edu, subject, and lines are metadata is a concrete insight.

Let's evaluate the output. The top 20 words are listed. Insight: The list is dominated by email header terms (like edu, subject, lines, organization) and digits, indicating that the documents contain raw email headers which were not stripped before tokenization.

Is Path A appropriate? Yes, the goal "show top 20 most frequent words" is technically met. But if the user wants meaningful words, they might be disappointed. However, the prompt asks to analyze against the original goal. The original goal is a simple "show top 20...".

I will provide the final answer based on the observed output.

Top words:

edu: 7573
1: 6756
0: 6482
2: 4485
subject: 4009 ... and so on.

Insights:

The most frequent "words" are actually metadata from email headers (e.g., edu, subject, lines, organization) and numeric digits.
This suggests the dataset contains raw email formats where header information is treated as part of the text.
Words like space (2,166) appear in the top 20, reflecting the sci.space category.

I will use Path A.

#26

The top 20 most frequent words are dominated by email header metadata and numeric digits rather than thematic content. Terms such as edu (7,573), subject (4,009), lines (3,944), and organization (3,783) appear most often, reflecting the raw email format of the 20 Newsgroups dataset. Numeric strings like 1, 0, and 2 also rank highly. Among the more descriptive words, space (2,166) is the most frequent, likely driven by the sci.space category.

#27

extract top TF-IDF keywords per category

#28

Thinking

The user wants to extract the top TF-IDF keywords for each of the four categories (sci.space, rec.sport.hockey, comp.graphics, talk.politics.misc).

*   `data` object from `fetch_20newsgroups` is available.
*   `data.data` contains the documents.
*   `data.target` contains the labels.
*   `data.target_names` contains the category names.

*   Need `TfidfVectorizer` from `sklearn.feature_extraction.text`.
*   Need to initialize the vectorizer.
*   Fit-transform the `data.data`.
*   Iterate through each category.
*   Average the TF-IDF scores for all documents within a specific category to get the representative keywords for that category.
*   Sort and pick the top $N$ words (e.g., top 10).

*   Step 1: Initialize `TfidfVectorizer` and transform the data.
*   Step 2: Compute the mean TF-IDF score per feature for each class.
*   Step 3: Map feature indices back to words and display the results.

*   Use `TfidfVectorizer`.
*   Avoid including metadata (headers, footers) if possible, though the previous "most frequent words" step showed a lot of metadata (`edu`, `subject`, `lines`). The `TfidfVectorizer` might still pick these up if they aren't consistently distributed, but usually, TF-IDF helps penalize words that appear in every document. However, since `fetch_20newsgroups` has specific `remove` parameters, I could have suggested using those, but since I'm working with the already loaded `data`, I'll proceed with the vectorizer.
*   Actually, `TfidfVectorizer` handles stopwords if passed `stop_words='english'`.

*   Create a dictionary or DataFrame to store results.
*   Loop through categories.
*   Extract top words.

#29

To extract the top TF-IDF keywords per category, I will use TfidfVectorizer to calculate the importance of words relative to each category, then average the scores for all documents in each group to find the most characteristic terms.

#30

Extract top category keywords

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Initialize TfidfVectorizer with English stopwords
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(data.data)
feature_names = tfidf.get_feature_names_out()

# Compute mean TF-IDF scores for each category
category_keywords = {}
for i, category_name in enumerate(data.target_names):
    # Get indices of documents belonging to this category
    indices = np.where(data.target == i)[0]
    
    # Average TF-IDF scores for these documents
    category_tfidf = tfidf_matrix[indices].mean(axis=0).A1
    
    # Get the top 10 words
    top_indices = category_tfidf.argsort()[-10:][::-1]
    category_keywords[category_name] = [feature_names[idx] for idx in top_indices]

# Convert to a readable DataFrame
pd.DataFrame(category_keywords)

#31

	comp.graphics	rec.sport.hockey	sci.space	talk.politics.misc
0	graphics	game	space	com
1	edu	edu	edu	edu
2	image	ca	nasa	cramer
3	com	hockey	henry	people
4	lines	team	access	optilink
5	file	games	com	article
6	subject	nhl	digex	writes
7	thanks	espn	alaska	government
8	3d	play	gov	gay
9	files	university	moon	clayton

#32

Thinking

The user goal is to "extract top TF-IDF keywords per category". The notebook state shows that TfidfVectorizer was used, the mean TF-IDF scores per category were calculated, and a DataFrame with the top 10 keywords per category was generated and displayed in [output-5].

Analysis of [output-5]:

comp.graphics: graphics, edu, image, com, lines, file, subject, thanks, 3d, files.
rec.sport.hockey: game, edu, ca, hockey, team, games, nhl, espn, play, university.
sci.space: space, edu, nasa, henry, access, com, digex, alaska, gov, moon.
talk.politics.misc: com, edu, cramer, people, optilink, article, writes, government, gay, clayton.

Observations:

The keywords successfully distinguish categories (e.g., "hockey", "nhl" for sports; "nasa", "moon" for space; "graphics", "3d" for graphics).
However, generic terms like "edu", "com", "lines", and "subject" appear across multiple categories. These are "noise" words common in email headers, which the basic 'english' stopword list did not catch.

The original goal was simply to "extract top TF-IDF keywords per category". The output provides exactly that.

Decision: Path A. The outputs are sufficient. I will provide a summary of the keywords extracted and their meaning.

#33

The TF-IDF analysis successfully extracted keywords that distinguish the four categories:

sci.space: Strong thematic keywords such as space, nasa, moon, and alaska clearly identify this category.
rec.sport.hockey: Specific terminology like hockey, nhl, espn, and game dominates the top terms.
comp.graphics: Technical terms such as graphics, image, 3d, and files characterize this group.
talk.politics.misc: This category shows more fragmented results with names like cramer and clayton alongside government, reflecting a discussion-based format.

Across all categories, the presence of terms like edu, com, and subject indicates that email header metadata remains prevalent in the TF-IDF scores, as these terms appear frequently across the entire dataset.

Score cards for gemma4:31b for Text Data EDA in Python

Total score: 8/10

Task Completion: 2/2
Excellent
All four requested analyses were attempted: category counts, text length distribution, top word frequencies, and TF-IDF keywords per category.
Execution Correctness: 2/2
Excellent
The code shown is syntactically coherent and the outputs indicate it ran successfully, including the dataset load, plot generation, frequency table, and TF-IDF results.
Output Quality: 2/3
Good
The notebook produced the expected core outputs, but the TF-IDF result is a table of top 10 keywords per category rather than the requested top 5, and there is no explicit bigram analysis despite the workflow description mentioning it.
Reasoning Quality: 1/2
Needs work
The reasoning is generally sensible, but it includes unnecessary detours and some mismatch between the stated goal and the implemented outputs, especially around keyword count and the absent bigram step.
Reliability: 1/1
Excellent
The workflow is mostly consistent and runnable, though it relies on notebook state and includes extra, partially mismatched analysis steps that reduce robustness.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Free Trial Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Text Data EDA in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gemma4:31b

Score cards for gemma4:31b for Text Data EDA in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics