Nlp
Text Data EDA in Python
Explore a text dataset with word frequency analysis, top bigrams, text length distribution, and TF-IDF keyword extraction.
What
This AI Data Analyst workflow loads a 20 Newsgroups subset from scikit-learn with four categories and summarizes document counts by category. It analyzes basic text characteristics by plotting text length distributions across categories and computing the most frequent words after stopword removal. It also extracts TF-IDF keywords per category to surface terms that distinguish each group.
Who
This is for analysts and data scientists who need a repeatable exploratory data analysis template for multi-class text datasets. It helps validate dataset balance and inspect category-specific vocabulary before building NLP models.
Tools
- scikit-learn
- pandas
- numpy
- matplotlib
- seaborn
Outcomes
- Bar chart of document counts per category
- Box plot of text lengths per category
- Horizontal bar chart of top word frequencies after stopword removal
- Table of top TF-IDF keywords per category
Quality Score
8/10
Last scored: Apr 7, 2026
Task Completion: 2/2
ExcellentAll requested EDA steps were performed: loaded 4-category 20 Newsgroups subset, produced document counts, plotted text length distribution by category, computed top-20 word frequencies with stopword removal, and extracted TF-IDF keywords per category.
Execution Correctness: 2/2
ExcellentCode shown is coherent and likely runnable end-to-end (fetch_20newsgroups, pandas transforms, seaborn/matplotlib plots, CountVectorizer, TfidfVectorizer, per-class averaging). No obvious syntax or logic errors in the provided blocks.
Output Quality: 2/3
GoodOutputs include the counts table, a rendered boxplot image, a top-20 word frequency table, and a TF-IDF keyword table per category; however, the expected 'bar chart of document counts per category' is not shown (counts are tabular only). Also, TF-IDF shows top 10 keywords per category rather than the expected top 5.
Reasoning Quality: 1/2
Needs workReasoning is generally correct (e.g., log scale justification, metadata-token caveat), but includes some speculative commentary about notebook state/visuals and does not clearly align outputs to the exact expected artifacts (bar chart, top-5 keywords).
Reliability: 1/1
ExcellentWorkflow is reasonably robust using standard sklearn vectorizers and clear transformations; minor fragility remains (e.g., leaving headers/metadata in text leads to noisy frequent terms/keywords, and reliance on `display()` in the last loop).
Thinking
Evaluating dataset loading steps
I need to approach this task step by step since the user wants to load a dataset with four categories and see the document count for each. It looks like I'll first need to load the dataset and then create a counts table. I believe I'll use pandas to construct this table, which makes sense. Although the user mentioned default subsets, it might be smart to use "all" to get the complete picture unless they specify otherwise!Crafting dataset loading code
I think it’s best to stick to the user’s requested function call, using fetch_20newsgroups(categories=...) with the default subset being 'train'. After that, I'll count the documents by target_names, which is essential. It would be helpful to show the total number of documents as well. I aim to create a DataFrame for the counts, keeping the order of target_names. In the end, I'll ensure there's just one clear code block to display the counts.
Let’s start by loading that 4-category subset and showing the document counts by category.
Count documents by category
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
categories = ['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.misc']
news = fetch_20newsgroups(categories=categories)
counts = (
pd.Series(news.target)
.map(dict(enumerate(news.target_names)))
.value_counts()
.rename_axis('category')
.reset_index(name='document_count')
.sort_values('category')
.reset_index(drop=True)
)
counts| category | document_count | |
|---|---|---|
| 0 | comp.graphics | 584 |
| 1 | rec.sport.hockey | 600 |
| 2 | sci.space | 593 |
| 3 | talk.politics.misc | 465 |
Here is the document count for the requested 4-category 20 Newsgroups subset:
rec.sport.hockey: 600sci.space: 593comp.graphics: 584talk.politics.misc: 465
A quick read of this distribution:
- The first three categories are fairly balanced, all close to about 600 documents.
talk.politics.mischas noticeably fewer documents than the others.- Total documents in this subset: 2242.
Thinking
Creating a boxplot
I'm thinking about using a boxplot to clearly represent data across categories. Maybe I'll incorporate hue for more detail, but I want to keep it simple with just one chart. I need to create a DataFrame and use seaborn to generate the boxplot, ensuring that I include appropriate titles and labels for clarity. Keeping my code minimal will help maintain focus on the essential elements! Let's get started on that!
A simple way to compare text length across categories is a boxplot of word counts per document.
Plot document length by category
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df_len = pd.DataFrame({
'category': pd.Series(news.target).map(dict(enumerate(news.target_names))),
'word_count': pd.Series(news.data).str.split().str.len()
})
plt.figure(figsize=(10, 5))
sns.boxplot(data=df_len, x='category', y='word_count')
plt.title('Document Length Distribution by Category')
plt.xlabel('Category')
plt.ylabel('Word count')
plt.xticks(rotation=20)
plt.tight_layout()
plt.show()The boxplot was created, so you already have the requested visual comparison of text length across the four categories. What it shows is the distribution of document word counts within each category, not just the average length.
To make that plot interpretable with concrete values, the missing piece is a compact summary table of the same underlying word_count data by category.
Summarize word counts by category
df_len.groupby('category')['word_count'].describe().round(1)| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| category | ||||||||
| comp.graphics | 584.0 | 225.2 | 674.1 | 16.0 | 79.0 | 126.5 | 194.2 | 9168.0 |
| rec.sport.hockey | 600.0 | 305.4 | 605.5 | 17.0 | 117.8 | 184.5 | 311.0 | 11278.0 |
| sci.space | 593.0 | 300.0 | 468.0 | 20.0 | 111.0 | 181.0 | 301.0 | 6141.0 |
| talk.politics.misc | 465.0 | 447.0 | 821.2 | 19.0 | 156.0 | 251.0 | 421.0 | 7989.0 |
The text length distribution across categories is now clear from the boxplot and the summary statistics.
A few concrete takeaways:
talk.politics.mischas the longest documents overall. Its median length is about 251 words, with an average of 447 words, which is higher than every other category.comp.graphicshas the shortest documents overall. Its median is about 126.5 words and its average is about 225 words.rec.sport.hockeyandsci.spaceare fairly similar in typical length:rec.sport.hockey: median about 184.5 words, mean about 305.4sci.space: median about 181 words, mean about 300
- All categories show strong right-skew, meaning a smaller number of very long posts pull the averages upward. You can see that because the means are much larger than the medians in every group.
- There are also large outliers in every category, especially:
rec.sport.hockey: max 11278 wordscomp.graphics: max 9168 wordstalk.politics.misc: max 7989 wordssci.space: max 6141 words
So, in direct answer to your goal: the distribution plot shows that politics posts tend to be longest, graphics posts shortest, and hockey/space fall in the middle, with all categories containing some unusually long documents.
Thinking
Preparing code for word frequency
I need to provide a simple code solution using notebook state. The user wants to show the top 20 most frequent words after removing stopwords. I should consider using either sklearn’s CountVectorizer with stop_words='english' or nltk.
The goal is to compute frequencies from the news.data, filtering out stopwords and potentially any artifacts. I'll create a single code block that returns a table of those top 20 words. I’ll keep it concise for clarity!
Let’s compute the top 20 words across the selected categories after removing English stopwords. I’ll use CountVectorizer, which is a clean way to tokenize and count words.
Compute top word counts
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
vec = CountVectorizer(stop_words='english')
X = vec.fit_transform(news.data)
word_counts = pd.DataFrame({
'word': vec.get_feature_names_out(),
'count': X.sum(axis=0).A1
}).sort_values('count', ascending=False).head(20).reset_index(drop=True)
word_counts| word | count | |
|---|---|---|
| 0 | edu | 4379 |
| 1 | subject | 2391 |
| 2 | lines | 2367 |
| 3 | organization | 2280 |
| 4 | com | 2007 |
| 5 | writes | 1727 |
| 6 | article | 1622 |
| 7 | space | 1496 |
| 8 | university | 1290 |
| 9 | don | 1271 |
| 10 | just | 1239 |
| 11 | like | 1181 |
| 12 | ca | 1158 |
| 13 | posting | 1147 |
| 14 | think | 1118 |
| 15 | people | 1097 |
| 16 | host | 1045 |
| 17 | nntp | 1021 |
| 18 | new | 997 |
| 19 | know | 984 |
After removing English stopwords, the 20 most frequent words in these four newsgroup categories are:
edu— 4379subject— 2391lines— 2367organization— 2280com— 2007writes— 1727article— 1622space— 1496university— 1290don— 1271just— 1239like— 1181ca— 1158posting— 1147think— 1118people— 1097host— 1045nntp— 1021new— 997know— 984
A couple of useful observations from this result:
- Many top terms are email/news header artifacts rather than topic words, such as
edu,subject,lines,organization,com,host, andnntp. - The clearest content-specific high-frequency word is
space, which fits one of the selected categories directly. - Tokens like
donandcaare likely fragments from contractions such as “don’t” and “can’t”, which can happen with the default tokenizer.
So this does answer the request, but the current top-20 list is influenced quite a bit by metadata-style text, not just the message body content.
Thinking
Figuring out TF-IDF extraction
I’m working on TF-IDF per category by concatenating documents, which seems straightforward for keyword extraction. I’m thinking about refining this by removing headers later. I want to write a single code block using existing news categories. I’m considering the max_df parameter to downweight common words across categories, though it may behave unusually since I only have four documents.
It might be better not to use max_df at all. I’ll aim to create a tidy table with the top keywords per category. Let's make that happen!
Since the raw text includes header-style terms like subject and organization, a good next step is to compute category-level TF-IDF by combining all documents within each category and then taking the top terms per category.
Extract top TF IDF keywords
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
cat_map = dict(enumerate(news.target_names))
category_docs = (
pd.DataFrame({
'category': pd.Series(news.target).map(cat_map),
'text': news.data
})
.groupby('category', sort=True)['text']
.apply(' '.join)
)
vec = TfidfVectorizer(stop_words='english')
X_tfidf = vec.fit_transform(category_docs)
terms = vec.get_feature_names_out()
top_n = 10
rows = []
for i, category in enumerate(category_docs.index):
scores = X_tfidf[i].toarray().ravel()
top_idx = scores.argsort()[::-1][:top_n]
for idx in top_idx:
rows.append({
'category': category,
'keyword': terms[idx],
'tfidf': scores[idx]
})
top_keywords = pd.DataFrame(rows)
top_keywords| category | keyword | tfidf | |
|---|---|---|---|
| 0 | comp.graphics | edu | 0.354754 |
| 1 | comp.graphics | graphics | 0.252734 |
| 2 | comp.graphics | image | 0.222606 |
| 3 | comp.graphics | lines | 0.220652 |
| 4 | comp.graphics | subject | 0.213810 |
| 5 | comp.graphics | organization | 0.196705 |
| 6 | comp.graphics | com | 0.145733 |
| 7 | comp.graphics | jpeg | 0.144717 |
| 8 | comp.graphics | university | 0.122813 |
| 9 | comp.graphics | posting | 0.105024 |
| 10 | rec.sport.hockey | hockey | 0.297385 |
| 11 | rec.sport.hockey | edu | 0.259204 |
| 12 | rec.sport.hockey | nhl | 0.191994 |
| 13 | rec.sport.hockey | team | 0.169296 |
| 14 | rec.sport.hockey | ca | 0.159731 |
| 15 | rec.sport.hockey | subject | 0.151840 |
| 16 | rec.sport.hockey | lines | 0.148731 |
| 17 | rec.sport.hockey | organization | 0.145862 |
| 18 | rec.sport.hockey | game | 0.142753 |
| 19 | rec.sport.hockey | season | 0.142339 |
| 20 | sci.space | space | 0.388450 |
| 21 | sci.space | edu | 0.324079 |
| 22 | sci.space | nasa | 0.204491 |
| 23 | sci.space | subject | 0.175635 |
| 24 | sci.space | organization | 0.170363 |
| 25 | sci.space | lines | 0.170086 |
| 26 | sci.space | com | 0.140952 |
| 27 | sci.space | writes | 0.125414 |
| 28 | sci.space | orbit | 0.124083 |
| 29 | sci.space | article | 0.121807 |
| 30 | talk.politics.misc | edu | 0.312622 |
| 31 | talk.politics.misc | com | 0.228587 |
| 32 | talk.politics.misc | people | 0.195317 |
| 33 | talk.politics.misc | stephanopoulos | 0.191264 |
| 34 | talk.politics.misc | writes | 0.175814 |
| 35 | talk.politics.misc | article | 0.169791 |
| 36 | talk.politics.misc | don | 0.148280 |
| 37 | talk.politics.misc | subject | 0.142831 |
| 38 | talk.politics.misc | lines | 0.139676 |
| 39 | talk.politics.misc | organization | 0.137955 |
Thinking
Considering output structure
It seems the goal is to present the top keywords grouped by category. I’m thinking about whether to display them as sequential tables or a single pivoted table. A pivoted table could be clearer, with categories as rows and ranks 1-10 as columns. I'll aim for minimal code based on existing top_keywords. Truncating the output and possibly addressing tfidf order would be important too. Let’s figure out the best way to construct this DataFrame.Refining the code output
To simplify things, I’ll create a clean version of top_keywords. The code snippet I've crafted allows me to assign ranks within categories and round the tfidf values. This should generate a full table displaying all relevant data—40 rows might show up in a notebook by default, but to be safe, I might set the pandas option to display up to 50 rows. I need to mention that if there are truncations, patterns may still be visible. I'll just provide insights and code without asking for any execution or sharing.
The visible TF-IDF output is already pointing to sensible category-specific terms:
comp.graphicsis led by content words likegraphics,image, andjpeg, which match the category well.rec.sport.hockeyis led byhockey,nhl, andteam, also a strong match.- But metadata-style tokens such as
edu,subject,lines, andorganizationare still showing up, so the extracted keywords are partly influenced by headers rather than just topic vocabulary.
The current table is truncated, so here is a compact full ranking for all categories from the existing top_keywords result:
Rank and format keywords
top_keywords_full = top_keywords.copy()
top_keywords_full["rank"] = top_keywords_full.groupby("category").cumcount() + 1
top_keywords_full["tfidf"] = top_keywords_full["tfidf"].round(3)
top_keywords_full = top_keywords_full[["category", "rank", "keyword", "tfidf"]].sort_values(["category", "rank"]).reset_index(drop=True)
pd.set_option("display.max_rows", 50)
top_keywords_full| category | rank | keyword | tfidf | |
|---|---|---|---|---|
| 0 | comp.graphics | 1 | edu | 0.355 |
| 1 | comp.graphics | 2 | graphics | 0.253 |
| 2 | comp.graphics | 3 | image | 0.223 |
| 3 | comp.graphics | 4 | lines | 0.221 |
| 4 | comp.graphics | 5 | subject | 0.214 |
| 5 | comp.graphics | 6 | organization | 0.197 |
| 6 | comp.graphics | 7 | com | 0.146 |
| 7 | comp.graphics | 8 | jpeg | 0.145 |
| 8 | comp.graphics | 9 | university | 0.123 |
| 9 | comp.graphics | 10 | posting | 0.105 |
| 10 | rec.sport.hockey | 1 | hockey | 0.297 |
| 11 | rec.sport.hockey | 2 | edu | 0.259 |
| 12 | rec.sport.hockey | 3 | nhl | 0.192 |
| 13 | rec.sport.hockey | 4 | team | 0.169 |
| 14 | rec.sport.hockey | 5 | ca | 0.160 |
| 15 | rec.sport.hockey | 6 | subject | 0.152 |
| 16 | rec.sport.hockey | 7 | lines | 0.149 |
| 17 | rec.sport.hockey | 8 | organization | 0.146 |
| 18 | rec.sport.hockey | 9 | game | 0.143 |
| 19 | rec.sport.hockey | 10 | season | 0.142 |
| 20 | sci.space | 1 | space | 0.388 |
| 21 | sci.space | 2 | edu | 0.324 |
| 22 | sci.space | 3 | nasa | 0.204 |
| 23 | sci.space | 4 | subject | 0.176 |
| 24 | sci.space | 5 | organization | 0.170 |
| 25 | sci.space | 6 | lines | 0.170 |
| 26 | sci.space | 7 | com | 0.141 |
| 27 | sci.space | 8 | writes | 0.125 |
| 28 | sci.space | 9 | orbit | 0.124 |
| 29 | sci.space | 10 | article | 0.122 |
| 30 | talk.politics.misc | 1 | edu | 0.313 |
| 31 | talk.politics.misc | 2 | com | 0.229 |
| 32 | talk.politics.misc | 3 | people | 0.195 |
| 33 | talk.politics.misc | 4 | stephanopoulos | 0.191 |
| 34 | talk.politics.misc | 5 | writes | 0.176 |
| 35 | talk.politics.misc | 6 | article | 0.170 |
| 36 | talk.politics.misc | 7 | don | 0.148 |
| 37 | talk.politics.misc | 8 | subject | 0.143 |
| 38 | talk.politics.misc | 9 | lines | 0.140 |
| 39 | talk.politics.misc | 10 | organization | 0.138 |
Thinking
Deciding on keyword extraction
I need to choose between Path A or B to extract the top TF-IDF keywords per category. It seems like the recent outputs, particularly output-5 and output-6, are truncated for categories beyond the first. I think we need a full table to truly rank values. Path B might be better, as it could allow us to display the complete table with the necessary insights. I'll ensure to include a single code block to keep the output concise and focused.Creating full rankings display
I need to code a display for full rankings, ensuring it's notably different from the last step, which merely assigned and showed the top keywords. Here, I think using the display function for each category in sequence makes sense, or maybe I should convert them to a pivot format. I plan to write minimal code and leverage a loop for sequential display without needing a final expression. The current outputs suggest a good signal-to-noise ratio, so I'll focus on presenting that clearly.
The extraction is already working: the visible results show strong topic terms like graphics, image, jpeg for comp.graphics and hockey, nhl, team for rec.sport.hockey. At the same time, header-like terms such as edu, subject, organization, and lines are still ranking highly, so the current keywords are a mix of true topic words and message metadata.
The only missing piece is the full per-category ranking, since the current notebook display is truncated. This shows the complete top TF-IDF keywords for each category:
Notebook code
for category in top_keywords["category"].drop_duplicates():
display(
top_keywords.loc[top_keywords["category"] == category]
.assign(
rank=lambda d: range(1, len(d) + 1),
tfidf=lambda d: d["tfidf"].round(3)
)[["category", "rank", "keyword", "tfidf"]]
.reset_index(drop=True)
)| category | rank | keyword | tfidf | |
|---|---|---|---|---|
| 0 | comp.graphics | 1 | edu | 0.355 |
| 1 | comp.graphics | 2 | graphics | 0.253 |
| 2 | comp.graphics | 3 | image | 0.223 |
| 3 | comp.graphics | 4 | lines | 0.221 |
| 4 | comp.graphics | 5 | subject | 0.214 |
| 5 | comp.graphics | 6 | organization | 0.197 |
| 6 | comp.graphics | 7 | com | 0.146 |
| 7 | comp.graphics | 8 | jpeg | 0.145 |
| 8 | comp.graphics | 9 | university | 0.123 |
| 9 | comp.graphics | 10 | posting | 0.105 |
| category | rank | keyword | tfidf | |
|---|---|---|---|---|
| 0 | rec.sport.hockey | 1 | hockey | 0.297 |
| 1 | rec.sport.hockey | 2 | edu | 0.259 |
| 2 | rec.sport.hockey | 3 | nhl | 0.192 |
| 3 | rec.sport.hockey | 4 | team | 0.169 |
| 4 | rec.sport.hockey | 5 | ca | 0.160 |
| 5 | rec.sport.hockey | 6 | subject | 0.152 |
| 6 | rec.sport.hockey | 7 | lines | 0.149 |
| 7 | rec.sport.hockey | 8 | organization | 0.146 |
| 8 | rec.sport.hockey | 9 | game | 0.143 |
| 9 | rec.sport.hockey | 10 | season | 0.142 |
| category | rank | keyword | tfidf | |
|---|---|---|---|---|
| 0 | sci.space | 1 | space | 0.388 |
| 1 | sci.space | 2 | edu | 0.324 |
| 2 | sci.space | 3 | nasa | 0.204 |
| 3 | sci.space | 4 | subject | 0.176 |
| 4 | sci.space | 5 | organization | 0.170 |
| 5 | sci.space | 6 | lines | 0.170 |
| 6 | sci.space | 7 | com | 0.141 |
| 7 | sci.space | 8 | writes | 0.125 |
| 8 | sci.space | 9 | orbit | 0.124 |
| 9 | sci.space | 10 | article | 0.122 |
| category | rank | keyword | tfidf | |
|---|---|---|---|---|
| 0 | talk.politics.misc | 1 | edu | 0.313 |
| 1 | talk.politics.misc | 2 | com | 0.229 |
| 2 | talk.politics.misc | 3 | people | 0.195 |
| 3 | talk.politics.misc | 4 | stephanopoulos | 0.191 |
| 4 | talk.politics.misc | 5 | writes | 0.176 |
| 5 | talk.politics.misc | 6 | article | 0.170 |
| 6 | talk.politics.misc | 7 | don | 0.148 |
| 7 | talk.politics.misc | 8 | subject | 0.143 |
| 8 | talk.politics.misc | 9 | lines | 0.140 |
| 9 | talk.politics.misc | 10 | organization | 0.138 |
Thinking
Crafting concise insights
I need to ensure that I provide an answer that aligns with path A and connects to the goal. I can include concrete insights without using code or excessive headings. Instead of restating full tables, I can summarize the top keywords per category from the output. I should mention that the headers leaked due to the fetch_20newsgroups default retaining unnecessary information. Additionally, I’ll note that some of the extracted keywords are artifacts. Finally, I’ll add a final answer marker to wrap it up.
From the observed TF-IDF rankings, the category-specific keywords are extracted and they largely align with the topics:
- comp.graphics:
graphics,image,jpegare the most meaningful topic words. This category also contains non-topic terms likeedu,lines,subject,organization,com, andposting. - rec.sport.hockey:
hockey,nhl,team,game,seasonclearly capture the category. Some metadata-style terms still appear, such asedu,ca,subject,lines, andorganization. - sci.space:
space,nasa, andorbitare strong topic keywords. The rest again includes message-header vocabulary likeedu,subject,organization,lines,com,writes, andarticle. - talk.politics.misc: the most distinctive content terms are weaker here, but
peopleandstephanopoulosstand out as topical. Many top terms are still generic or header-related:edu,com,writes,article,subject,lines,organization.
So the extraction worked, but the results are partly contaminated by email/news headers and other boilerplate. The strongest topic keywords visible from the current outputs are:
- comp.graphics →
graphics,image,jpeg - rec.sport.hockey →
hockey,nhl,team,game,season - sci.space →
space,nasa,orbit - talk.politics.misc →
people,stephanopoulos
This means the notebook has successfully extracted top TF-IDF keywords per category, with the main limitation being that header terms are still included in the vocabulary.