Nlp

Text Data EDA in Python

Explore a text dataset with word frequency analysis, top bigrams, text length distribution, and TF-IDF keyword extraction.

What

This AI Data Analyst workflow loads a 20 Newsgroups subset from scikit-learn with four categories and summarizes document counts by category. It analyzes basic text characteristics by plotting text length distributions across categories and computing the most frequent words after stopword removal. It also extracts TF-IDF keywords per category to surface terms that distinguish each group.

Who

This is for analysts and data scientists who need a repeatable exploratory data analysis template for multi-class text datasets. It helps validate dataset balance and inspect category-specific vocabulary before building NLP models.

Tools

scikit-learn
pandas
numpy
matplotlib
seaborn

Outcomes

Bar chart of document counts per category
Box plot of text lengths per category
Horizontal bar chart of top word frequencies after stopword removal
Table of top TF-IDF keywords per category

Quality Score

8/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 2/2
Excellent
All requested EDA steps were performed: loaded 4-category 20 Newsgroups subset, produced document counts, plotted text length distribution by category, computed top-20 word frequencies with stopword removal, and extracted TF-IDF keywords per category.
Execution Correctness: 2/2
Excellent
Code shown is coherent and likely runnable end-to-end (fetch_20newsgroups, pandas transforms, seaborn/matplotlib plots, CountVectorizer, TfidfVectorizer, per-class averaging). No obvious syntax or logic errors in the provided blocks.
Output Quality: 2/3
Good
Outputs include the counts table, a rendered boxplot image, a top-20 word frequency table, and a TF-IDF keyword table per category; however, the expected 'bar chart of document counts per category' is not shown (counts are tabular only). Also, TF-IDF shows top 10 keywords per category rather than the expected top 5.
Reasoning Quality: 1/2
Needs work
Reasoning is generally correct (e.g., log scale justification, metadata-token caveat), but includes some speculative commentary about notebook state/visuals and does not clearly align outputs to the exact expected artifacts (bar chart, top-5 keywords).
Reliability: 1/1
Excellent
Workflow is reasonably robust using standard sklearn vectorizers and clear transformations; minor fragility remains (e.g., leaving headers/metadata in text leads to noisy frequent terms/keywords, and reliance on `display()` in the last loop).

load 20 newsgroups dataset (20 Newsgroups from sklearn) with 4 categories and show document count per category fetch_20newsgroups(categories=['sci.space','rec.sport.hockey','comp.graphics','talk.politics.misc'])

Thinking

Evaluating dataset loading steps

I need to approach this task step by step since the user wants to load a dataset with four categories and see the document count for each. It looks like I'll first need to load the dataset and then create a counts table. I believe I'll use pandas to construct this table, which makes sense. Although the user mentioned default subsets, it might be smart to use "all" to get the complete picture unless they specify otherwise!Crafting dataset loading code

I think it’s best to stick to the user’s requested function call, using fetch_20newsgroups(categories=...) with the default subset being 'train'. After that, I'll count the documents by target_names, which is essential. It would be helpful to show the total number of documents as well. I aim to create a DataFrame for the counts, keeping the order of target_names. In the end, I'll ensure there's just one clear code block to display the counts.

Let’s start by loading that 4-category subset and showing the document counts by category.

Count documents by category

from sklearn.datasets import fetch_20newsgroups
import pandas as pd

categories = ['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.misc']

news = fetch_20newsgroups(categories=categories)

counts = (
    pd.Series(news.target)
    .map(dict(enumerate(news.target_names)))
    .value_counts()
    .rename_axis('category')
    .reset_index(name='document_count')
    .sort_values('category')
    .reset_index(drop=True)
)

counts

	category	document_count
0	comp.graphics	584
1	rec.sport.hockey	600
2	sci.space	593
3	talk.politics.misc	465

Here is the document count for the requested 4-category 20 Newsgroups subset:

rec.sport.hockey: 600
sci.space: 593
comp.graphics: 584
talk.politics.misc: 465

A quick read of this distribution:

The first three categories are fairly balanced, all close to about 600 documents.
talk.politics.misc has noticeably fewer documents than the others.
Total documents in this subset: 2242.

plot text length distribution across categories

Thinking

Creating a boxplot

I'm thinking about using a boxplot to clearly represent data across categories. Maybe I'll incorporate hue for more detail, but I want to keep it simple with just one chart. I need to create a DataFrame and use seaborn to generate the boxplot, ensuring that I include appropriate titles and labels for clarity. Keeping my code minimal will help maintain focus on the essential elements! Let's get started on that!

A simple way to compare text length across categories is a boxplot of word counts per document.

#10

Plot document length by category

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df_len = pd.DataFrame({
    'category': pd.Series(news.target).map(dict(enumerate(news.target_names))),
    'word_count': pd.Series(news.data).str.split().str.len()
})

plt.figure(figsize=(10, 5))
sns.boxplot(data=df_len, x='category', y='word_count')
plt.title('Document Length Distribution by Category')
plt.xlabel('Category')
plt.ylabel('Word count')
plt.xticks(rotation=20)
plt.tight_layout()
plt.show()

#11

#12

The boxplot was created, so you already have the requested visual comparison of text length across the four categories. What it shows is the distribution of document word counts within each category, not just the average length.

To make that plot interpretable with concrete values, the missing piece is a compact summary table of the same underlying word_count data by category.

#13

Summarize word counts by category

df_len.groupby('category')['word_count'].describe().round(1)

#14

	count	mean	std	min	25%	50%	75%	max
category
comp.graphics	584.0	225.2	674.1	16.0	79.0	126.5	194.2	9168.0
rec.sport.hockey	600.0	305.4	605.5	17.0	117.8	184.5	311.0	11278.0
sci.space	593.0	300.0	468.0	20.0	111.0	181.0	301.0	6141.0
talk.politics.misc	465.0	447.0	821.2	19.0	156.0	251.0	421.0	7989.0

#15

The text length distribution across categories is now clear from the boxplot and the summary statistics.

A few concrete takeaways:

talk.politics.misc has the longest documents overall. Its median length is about 251 words, with an average of 447 words, which is higher than every other category.
comp.graphics has the shortest documents overall. Its median is about 126.5 words and its average is about 225 words.
rec.sport.hockey and sci.space are fairly similar in typical length:
- rec.sport.hockey: median about 184.5 words, mean about 305.4
- sci.space: median about 181 words, mean about 300
All categories show strong right-skew, meaning a smaller number of very long posts pull the averages upward. You can see that because the means are much larger than the medians in every group.
There are also large outliers in every category, especially:
- rec.sport.hockey: max 11278 words
- comp.graphics: max 9168 words
- talk.politics.misc: max 7989 words
- sci.space: max 6141 words

So, in direct answer to your goal: the distribution plot shows that politics posts tend to be longest, graphics posts shortest, and hockey/space fall in the middle, with all categories containing some unusually long documents.

#16

show top 20 most frequent words after removing stopwords

#17

Thinking

Preparing code for word frequency

I need to provide a simple code solution using notebook state. The user wants to show the top 20 most frequent words after removing stopwords. I should consider using either sklearn’s CountVectorizer with stop_words='english' or nltk.

The goal is to compute frequencies from the news.data, filtering out stopwords and potentially any artifacts. I'll create a single code block that returns a table of those top 20 words. I’ll keep it concise for clarity!

#18

Let’s compute the top 20 words across the selected categories after removing English stopwords. I’ll use CountVectorizer, which is a clean way to tokenize and count words.

#19

Compute top word counts

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vec = CountVectorizer(stop_words='english')
X = vec.fit_transform(news.data)

word_counts = pd.DataFrame({
    'word': vec.get_feature_names_out(),
    'count': X.sum(axis=0).A1
}).sort_values('count', ascending=False).head(20).reset_index(drop=True)

word_counts

#20

	word	count
0	edu	4379
1	subject	2391
2	lines	2367
3	organization	2280
4	com	2007
5	writes	1727
6	article	1622
7	space	1496
8	university	1290
9	don	1271
10	just	1239
11	like	1181
12	ca	1158
13	posting	1147
14	think	1118
15	people	1097
16	host	1045
17	nntp	1021
18	new	997
19	know	984

#21

After removing English stopwords, the 20 most frequent words in these four newsgroup categories are:

edu — 4379
subject — 2391
lines — 2367
organization — 2280
com — 2007
writes — 1727
article — 1622
space — 1496
university — 1290
don — 1271
just — 1239
like — 1181
ca — 1158
posting — 1147
think — 1118
people — 1097
host — 1045
nntp — 1021
new — 997
know — 984

A couple of useful observations from this result:

Many top terms are email/news header artifacts rather than topic words, such as edu, subject, lines, organization, com, host, and nntp.
The clearest content-specific high-frequency word is space, which fits one of the selected categories directly.
Tokens like don and ca are likely fragments from contractions such as “don’t” and “can’t”, which can happen with the default tokenizer.

So this does answer the request, but the current top-20 list is influenced quite a bit by metadata-style text, not just the message body content.

#22

extract top TF-IDF keywords per category

#23

Thinking

Figuring out TF-IDF extraction

I’m working on TF-IDF per category by concatenating documents, which seems straightforward for keyword extraction. I’m thinking about refining this by removing headers later. I want to write a single code block using existing news categories. I’m considering the max_df parameter to downweight common words across categories, though it may behave unusually since I only have four documents.

It might be better not to use max_df at all. I’ll aim to create a tidy table with the top keywords per category. Let's make that happen!

#24

Since the raw text includes header-style terms like subject and organization, a good next step is to compute category-level TF-IDF by combining all documents within each category and then taking the top terms per category.

#25

Extract top TF IDF keywords

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

cat_map = dict(enumerate(news.target_names))

category_docs = (
    pd.DataFrame({
        'category': pd.Series(news.target).map(cat_map),
        'text': news.data
    })
    .groupby('category', sort=True)['text']
    .apply(' '.join)
)

vec = TfidfVectorizer(stop_words='english')
X_tfidf = vec.fit_transform(category_docs)

terms = vec.get_feature_names_out()
top_n = 10

rows = []
for i, category in enumerate(category_docs.index):
    scores = X_tfidf[i].toarray().ravel()
    top_idx = scores.argsort()[::-1][:top_n]
    for idx in top_idx:
        rows.append({
            'category': category,
            'keyword': terms[idx],
            'tfidf': scores[idx]
        })

top_keywords = pd.DataFrame(rows)
top_keywords

#26

	category	keyword	tfidf
0	comp.graphics	edu	0.354754
1	comp.graphics	graphics	0.252734
2	comp.graphics	image	0.222606
3	comp.graphics	lines	0.220652
4	comp.graphics	subject	0.213810
5	comp.graphics	organization	0.196705
6	comp.graphics	com	0.145733
7	comp.graphics	jpeg	0.144717
8	comp.graphics	university	0.122813
9	comp.graphics	posting	0.105024
10	rec.sport.hockey	hockey	0.297385
11	rec.sport.hockey	edu	0.259204
12	rec.sport.hockey	nhl	0.191994
13	rec.sport.hockey	team	0.169296
14	rec.sport.hockey	ca	0.159731
15	rec.sport.hockey	subject	0.151840
16	rec.sport.hockey	lines	0.148731
17	rec.sport.hockey	organization	0.145862
18	rec.sport.hockey	game	0.142753
19	rec.sport.hockey	season	0.142339
20	sci.space	space	0.388450
21	sci.space	edu	0.324079
22	sci.space	nasa	0.204491
23	sci.space	subject	0.175635
24	sci.space	organization	0.170363
25	sci.space	lines	0.170086
26	sci.space	com	0.140952
27	sci.space	writes	0.125414
28	sci.space	orbit	0.124083
29	sci.space	article	0.121807
30	talk.politics.misc	edu	0.312622
31	talk.politics.misc	com	0.228587
32	talk.politics.misc	people	0.195317
33	talk.politics.misc	stephanopoulos	0.191264
34	talk.politics.misc	writes	0.175814
35	talk.politics.misc	article	0.169791
36	talk.politics.misc	don	0.148280
37	talk.politics.misc	subject	0.142831
38	talk.politics.misc	lines	0.139676
39	talk.politics.misc	organization	0.137955

#27

Thinking

Considering output structure

It seems the goal is to present the top keywords grouped by category. I’m thinking about whether to display them as sequential tables or a single pivoted table. A pivoted table could be clearer, with categories as rows and ranks 1-10 as columns. I'll aim for minimal code based on existing top_keywords. Truncating the output and possibly addressing tfidf order would be important too. Let’s figure out the best way to construct this DataFrame.Refining the code output

To simplify things, I’ll create a clean version of top_keywords. The code snippet I've crafted allows me to assign ranks within categories and round the tfidf values. This should generate a full table displaying all relevant data—40 rows might show up in a notebook by default, but to be safe, I might set the pandas option to display up to 50 rows. I need to mention that if there are truncations, patterns may still be visible. I'll just provide insights and code without asking for any execution or sharing.

#28

The visible TF-IDF output is already pointing to sensible category-specific terms:

comp.graphics is led by content words like graphics, image, and jpeg, which match the category well.
rec.sport.hockey is led by hockey, nhl, and team, also a strong match.
But metadata-style tokens such as edu, subject, lines, and organization are still showing up, so the extracted keywords are partly influenced by headers rather than just topic vocabulary.

The current table is truncated, so here is a compact full ranking for all categories from the existing top_keywords result:

#29

Rank and format keywords

top_keywords_full = top_keywords.copy()
top_keywords_full["rank"] = top_keywords_full.groupby("category").cumcount() + 1
top_keywords_full["tfidf"] = top_keywords_full["tfidf"].round(3)
top_keywords_full = top_keywords_full[["category", "rank", "keyword", "tfidf"]].sort_values(["category", "rank"]).reset_index(drop=True)

pd.set_option("display.max_rows", 50)
top_keywords_full

#30

	category	rank	keyword	tfidf
0	comp.graphics	1	edu	0.355
1	comp.graphics	2	graphics	0.253
2	comp.graphics	3	image	0.223
3	comp.graphics	4	lines	0.221
4	comp.graphics	5	subject	0.214
5	comp.graphics	6	organization	0.197
6	comp.graphics	7	com	0.146
7	comp.graphics	8	jpeg	0.145
8	comp.graphics	9	university	0.123
9	comp.graphics	10	posting	0.105
10	rec.sport.hockey	1	hockey	0.297
11	rec.sport.hockey	2	edu	0.259
12	rec.sport.hockey	3	nhl	0.192
13	rec.sport.hockey	4	team	0.169
14	rec.sport.hockey	5	ca	0.160
15	rec.sport.hockey	6	subject	0.152
16	rec.sport.hockey	7	lines	0.149
17	rec.sport.hockey	8	organization	0.146
18	rec.sport.hockey	9	game	0.143
19	rec.sport.hockey	10	season	0.142
20	sci.space	1	space	0.388
21	sci.space	2	edu	0.324
22	sci.space	3	nasa	0.204
23	sci.space	4	subject	0.176
24	sci.space	5	organization	0.170
25	sci.space	6	lines	0.170
26	sci.space	7	com	0.141
27	sci.space	8	writes	0.125
28	sci.space	9	orbit	0.124
29	sci.space	10	article	0.122
30	talk.politics.misc	1	edu	0.313
31	talk.politics.misc	2	com	0.229
32	talk.politics.misc	3	people	0.195
33	talk.politics.misc	4	stephanopoulos	0.191
34	talk.politics.misc	5	writes	0.176
35	talk.politics.misc	6	article	0.170
36	talk.politics.misc	7	don	0.148
37	talk.politics.misc	8	subject	0.143
38	talk.politics.misc	9	lines	0.140
39	talk.politics.misc	10	organization	0.138

#31

Thinking

Deciding on keyword extraction

I need to choose between Path A or B to extract the top TF-IDF keywords per category. It seems like the recent outputs, particularly output-5 and output-6, are truncated for categories beyond the first. I think we need a full table to truly rank values. Path B might be better, as it could allow us to display the complete table with the necessary insights. I'll ensure to include a single code block to keep the output concise and focused.Creating full rankings display

I need to code a display for full rankings, ensuring it's notably different from the last step, which merely assigned and showed the top keywords. Here, I think using the display function for each category in sequence makes sense, or maybe I should convert them to a pivot format. I plan to write minimal code and leverage a loop for sequential display without needing a final expression. The current outputs suggest a good signal-to-noise ratio, so I'll focus on presenting that clearly.

#32

The extraction is already working: the visible results show strong topic terms like graphics, image, jpeg for comp.graphics and hockey, nhl, team for rec.sport.hockey. At the same time, header-like terms such as edu, subject, organization, and lines are still ranking highly, so the current keywords are a mix of true topic words and message metadata.

The only missing piece is the full per-category ranking, since the current notebook display is truncated. This shows the complete top TF-IDF keywords for each category:

#33

Notebook code

for category in top_keywords["category"].drop_duplicates():
    display(
        top_keywords.loc[top_keywords["category"] == category]
        .assign(
            rank=lambda d: range(1, len(d) + 1),
            tfidf=lambda d: d["tfidf"].round(3)
        )[["category", "rank", "keyword", "tfidf"]]
        .reset_index(drop=True)
    )

#34

	category	rank	keyword	tfidf
0	comp.graphics	1	edu	0.355
1	comp.graphics	2	graphics	0.253
2	comp.graphics	3	image	0.223
3	comp.graphics	4	lines	0.221
4	comp.graphics	5	subject	0.214
5	comp.graphics	6	organization	0.197
6	comp.graphics	7	com	0.146
7	comp.graphics	8	jpeg	0.145
8	comp.graphics	9	university	0.123
9	comp.graphics	10	posting	0.105

#35

	category	rank	keyword	tfidf
0	rec.sport.hockey	1	hockey	0.297
1	rec.sport.hockey	2	edu	0.259
2	rec.sport.hockey	3	nhl	0.192
3	rec.sport.hockey	4	team	0.169
4	rec.sport.hockey	5	ca	0.160
5	rec.sport.hockey	6	subject	0.152
6	rec.sport.hockey	7	lines	0.149
7	rec.sport.hockey	8	organization	0.146
8	rec.sport.hockey	9	game	0.143
9	rec.sport.hockey	10	season	0.142

#36

	category	rank	keyword	tfidf
0	sci.space	1	space	0.388
1	sci.space	2	edu	0.324
2	sci.space	3	nasa	0.204
3	sci.space	4	subject	0.176
4	sci.space	5	organization	0.170
5	sci.space	6	lines	0.170
6	sci.space	7	com	0.141
7	sci.space	8	writes	0.125
8	sci.space	9	orbit	0.124
9	sci.space	10	article	0.122

#37

	category	rank	keyword	tfidf
0	talk.politics.misc	1	edu	0.313
1	talk.politics.misc	2	com	0.229
2	talk.politics.misc	3	people	0.195
3	talk.politics.misc	4	stephanopoulos	0.191
4	talk.politics.misc	5	writes	0.176
5	talk.politics.misc	6	article	0.170
6	talk.politics.misc	7	don	0.148
7	talk.politics.misc	8	subject	0.143
8	talk.politics.misc	9	lines	0.140
9	talk.politics.misc	10	organization	0.138

#38

Thinking

Crafting concise insights

I need to ensure that I provide an answer that aligns with path A and connects to the goal. I can include concrete insights without using code or excessive headings. Instead of restating full tables, I can summarize the top keywords per category from the output. I should mention that the headers leaked due to the fetch_20newsgroups default retaining unnecessary information. Additionally, I’ll note that some of the extracted keywords are artifacts. Finally, I’ll add a final answer marker to wrap it up.

#39

From the observed TF-IDF rankings, the category-specific keywords are extracted and they largely align with the topics:

comp.graphics: graphics, image, jpeg are the most meaningful topic words. This category also contains non-topic terms like edu, lines, subject, organization, com, and posting.
rec.sport.hockey: hockey, nhl, team, game, season clearly capture the category. Some metadata-style terms still appear, such as edu, ca, subject, lines, and organization.
sci.space: space, nasa, and orbit are strong topic keywords. The rest again includes message-header vocabulary like edu, subject, organization, lines, com, writes, and article.
talk.politics.misc: the most distinctive content terms are weaker here, but people and stephanopoulos stand out as topical. Many top terms are still generic or header-related: edu, com, writes, article, subject, lines, organization.

So the extraction worked, but the results are partly contaminated by email/news headers and other boilerplate. The strongest topic keywords visible from the current outputs are:

comp.graphics → graphics, image, jpeg
rec.sport.hockey → hockey, nhl, team, game, season
sci.space → space, nasa, orbit
talk.politics.misc → people, stephanopoulos

This means the notebook has successfully extracted top TF-IDF keywords per category, with the main limitation being that header terms are still included in the vocabulary.

Text Data EDA in Python

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics