Nlp

Text Data EDA in Python

Explore a text dataset with word frequency analysis, top bigrams, text length distribution, and TF-IDF keyword extraction.

What

This AI Data Analyst workflow loads a 20 Newsgroups subset from scikit-learn with four categories and summarizes document counts by category. It analyzes basic text characteristics by plotting text length distributions across categories and computing the most frequent words after stopword removal. It also extracts TF-IDF keywords per category to surface terms that distinguish each group.

Who

This is for analysts and data scientists who need a repeatable exploratory data analysis template for multi-class text datasets. It helps validate dataset balance and inspect category-specific vocabulary before building NLP models.

Tools

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Outcomes

  • Bar chart of document counts per category
  • Box plot of text lengths per category
  • Horizontal bar chart of top word frequencies after stopword removal
  • Table of top TF-IDF keywords per category

Quality Score

8/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

  • Task Completion: 2/2

    Excellent

    All requested EDA steps were performed: loaded 4-category 20 Newsgroups subset, produced document counts, plotted text length distribution by category, computed top-20 word frequencies with stopword removal, and extracted TF-IDF keywords per category.

  • Execution Correctness: 2/2

    Excellent

    Code shown is coherent and likely runnable end-to-end (fetch_20newsgroups, pandas transforms, seaborn/matplotlib plots, CountVectorizer, TfidfVectorizer, per-class averaging). No obvious syntax or logic errors in the provided blocks.

  • Output Quality: 2/3

    Good

    Outputs include the counts table, a rendered boxplot image, a top-20 word frequency table, and a TF-IDF keyword table per category; however, the expected 'bar chart of document counts per category' is not shown (counts are tabular only). Also, TF-IDF shows top 10 keywords per category rather than the expected top 5.

  • Reasoning Quality: 1/2

    Needs work

    Reasoning is generally correct (e.g., log scale justification, metadata-token caveat), but includes some speculative commentary about notebook state/visuals and does not clearly align outputs to the exact expected artifacts (bar chart, top-5 keywords).

  • Reliability: 1/1

    Excellent

    Workflow is reasonably robust using standard sklearn vectorizers and clear transformations; minor fragility remains (e.g., leaving headers/metadata in text leads to noisy frequent terms/keywords, and reliance on `display()` in the last loop).

load 20 newsgroups dataset (20 Newsgroups from sklearn) with 4 categories and show document count per category fetch_20newsgroups(categories=['sci.space','rec.sport.hockey','comp.graphics','talk.politics.misc'])
Thinking

Evaluating dataset loading steps

I need to approach this task step by step since the user wants to load a dataset with four categories and see the document count for each. It looks like I'll first need to load the dataset and then create a counts table. I believe I'll use pandas to construct this table, which makes sense. Although the user mentioned default subsets, it might be smart to use "all" to get the complete picture unless they specify otherwise!Crafting dataset loading code

I think it’s best to stick to the user’s requested function call, using fetch_20newsgroups(categories=...) with the default subset being 'train'. After that, I'll count the documents by target_names, which is essential. It would be helpful to show the total number of documents as well. I aim to create a DataFrame for the counts, keeping the order of target_names. In the end, I'll ensure there's just one clear code block to display the counts.

Let’s start by loading that 4-category subset and showing the document counts by category.

Count documents by category
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

categories = ['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.misc']

news = fetch_20newsgroups(categories=categories)

counts = (
    pd.Series(news.target)
    .map(dict(enumerate(news.target_names)))
    .value_counts()
    .rename_axis('category')
    .reset_index(name='document_count')
    .sort_values('category')
    .reset_index(drop=True)
)

counts
category document_count
0 comp.graphics 584
1 rec.sport.hockey 600
2 sci.space 593
3 talk.politics.misc 465

Here is the document count for the requested 4-category 20 Newsgroups subset:

  • rec.sport.hockey: 600
  • sci.space: 593
  • comp.graphics: 584
  • talk.politics.misc: 465

A quick read of this distribution:

  • The first three categories are fairly balanced, all close to about 600 documents.
  • talk.politics.misc has noticeably fewer documents than the others.
  • Total documents in this subset: 2242.
plot text length distribution across categories
Thinking

Creating a boxplot

I'm thinking about using a boxplot to clearly represent data across categories. Maybe I'll incorporate hue for more detail, but I want to keep it simple with just one chart. I need to create a DataFrame and use seaborn to generate the boxplot, ensuring that I include appropriate titles and labels for clarity. Keeping my code minimal will help maintain focus on the essential elements! Let's get started on that!

A simple way to compare text length across categories is a boxplot of word counts per document.

Plot document length by category
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df_len = pd.DataFrame({
    'category': pd.Series(news.target).map(dict(enumerate(news.target_names))),
    'word_count': pd.Series(news.data).str.split().str.len()
})

plt.figure(figsize=(10, 5))
sns.boxplot(data=df_len, x='category', y='word_count')
plt.title('Document Length Distribution by Category')
plt.xlabel('Category')
plt.ylabel('Word count')
plt.xticks(rotation=20)
plt.tight_layout()
plt.show()

The boxplot was created, so you already have the requested visual comparison of text length across the four categories. What it shows is the distribution of document word counts within each category, not just the average length.

To make that plot interpretable with concrete values, the missing piece is a compact summary table of the same underlying word_count data by category.

Summarize word counts by category
df_len.groupby('category')['word_count'].describe().round(1)
count mean std min 25% 50% 75% max
category
comp.graphics 584.0 225.2 674.1 16.0 79.0 126.5 194.2 9168.0
rec.sport.hockey 600.0 305.4 605.5 17.0 117.8 184.5 311.0 11278.0
sci.space 593.0 300.0 468.0 20.0 111.0 181.0 301.0 6141.0
talk.politics.misc 465.0 447.0 821.2 19.0 156.0 251.0 421.0 7989.0

The text length distribution across categories is now clear from the boxplot and the summary statistics.

A few concrete takeaways:

  • talk.politics.misc has the longest documents overall. Its median length is about 251 words, with an average of 447 words, which is higher than every other category.
  • comp.graphics has the shortest documents overall. Its median is about 126.5 words and its average is about 225 words.
  • rec.sport.hockey and sci.space are fairly similar in typical length:
    • rec.sport.hockey: median about 184.5 words, mean about 305.4
    • sci.space: median about 181 words, mean about 300
  • All categories show strong right-skew, meaning a smaller number of very long posts pull the averages upward. You can see that because the means are much larger than the medians in every group.
  • There are also large outliers in every category, especially:
    • rec.sport.hockey: max 11278 words
    • comp.graphics: max 9168 words
    • talk.politics.misc: max 7989 words
    • sci.space: max 6141 words

So, in direct answer to your goal: the distribution plot shows that politics posts tend to be longest, graphics posts shortest, and hockey/space fall in the middle, with all categories containing some unusually long documents.

show top 20 most frequent words after removing stopwords
Thinking

Preparing code for word frequency

I need to provide a simple code solution using notebook state. The user wants to show the top 20 most frequent words after removing stopwords. I should consider using either sklearn’s CountVectorizer with stop_words='english' or nltk.

The goal is to compute frequencies from the news.data, filtering out stopwords and potentially any artifacts. I'll create a single code block that returns a table of those top 20 words. I’ll keep it concise for clarity!

Let’s compute the top 20 words across the selected categories after removing English stopwords. I’ll use CountVectorizer, which is a clean way to tokenize and count words.

Compute top word counts
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vec = CountVectorizer(stop_words='english')
X = vec.fit_transform(news.data)

word_counts = pd.DataFrame({
    'word': vec.get_feature_names_out(),
    'count': X.sum(axis=0).A1
}).sort_values('count', ascending=False).head(20).reset_index(drop=True)

word_counts
word count
0 edu 4379
1 subject 2391
2 lines 2367
3 organization 2280
4 com 2007
5 writes 1727
6 article 1622
7 space 1496
8 university 1290
9 don 1271
10 just 1239
11 like 1181
12 ca 1158
13 posting 1147
14 think 1118
15 people 1097
16 host 1045
17 nntp 1021
18 new 997
19 know 984

After removing English stopwords, the 20 most frequent words in these four newsgroup categories are:

  1. edu — 4379
  2. subject — 2391
  3. lines — 2367
  4. organization — 2280
  5. com — 2007
  6. writes — 1727
  7. article — 1622
  8. space — 1496
  9. university — 1290
  10. don — 1271
  11. just — 1239
  12. like — 1181
  13. ca — 1158
  14. posting — 1147
  15. think — 1118
  16. people — 1097
  17. host — 1045
  18. nntp — 1021
  19. new — 997
  20. know — 984

A couple of useful observations from this result:

  • Many top terms are email/news header artifacts rather than topic words, such as edu, subject, lines, organization, com, host, and nntp.
  • The clearest content-specific high-frequency word is space, which fits one of the selected categories directly.
  • Tokens like don and ca are likely fragments from contractions such as “don’t” and “can’t”, which can happen with the default tokenizer.

So this does answer the request, but the current top-20 list is influenced quite a bit by metadata-style text, not just the message body content.

extract top TF-IDF keywords per category
Thinking

Figuring out TF-IDF extraction

I’m working on TF-IDF per category by concatenating documents, which seems straightforward for keyword extraction. I’m thinking about refining this by removing headers later. I want to write a single code block using existing news categories. I’m considering the max_df parameter to downweight common words across categories, though it may behave unusually since I only have four documents.

It might be better not to use max_df at all. I’ll aim to create a tidy table with the top keywords per category. Let's make that happen!

Since the raw text includes header-style terms like subject and organization, a good next step is to compute category-level TF-IDF by combining all documents within each category and then taking the top terms per category.

Extract top TF IDF keywords
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

cat_map = dict(enumerate(news.target_names))

category_docs = (
    pd.DataFrame({
        'category': pd.Series(news.target).map(cat_map),
        'text': news.data
    })
    .groupby('category', sort=True)['text']
    .apply(' '.join)
)

vec = TfidfVectorizer(stop_words='english')
X_tfidf = vec.fit_transform(category_docs)

terms = vec.get_feature_names_out()
top_n = 10

rows = []
for i, category in enumerate(category_docs.index):
    scores = X_tfidf[i].toarray().ravel()
    top_idx = scores.argsort()[::-1][:top_n]
    for idx in top_idx:
        rows.append({
            'category': category,
            'keyword': terms[idx],
            'tfidf': scores[idx]
        })

top_keywords = pd.DataFrame(rows)
top_keywords
category keyword tfidf
0 comp.graphics edu 0.354754
1 comp.graphics graphics 0.252734
2 comp.graphics image 0.222606
3 comp.graphics lines 0.220652
4 comp.graphics subject 0.213810
5 comp.graphics organization 0.196705
6 comp.graphics com 0.145733
7 comp.graphics jpeg 0.144717
8 comp.graphics university 0.122813
9 comp.graphics posting 0.105024
10 rec.sport.hockey hockey 0.297385
11 rec.sport.hockey edu 0.259204
12 rec.sport.hockey nhl 0.191994
13 rec.sport.hockey team 0.169296
14 rec.sport.hockey ca 0.159731
15 rec.sport.hockey subject 0.151840
16 rec.sport.hockey lines 0.148731
17 rec.sport.hockey organization 0.145862
18 rec.sport.hockey game 0.142753
19 rec.sport.hockey season 0.142339
20 sci.space space 0.388450
21 sci.space edu 0.324079
22 sci.space nasa 0.204491
23 sci.space subject 0.175635
24 sci.space organization 0.170363
25 sci.space lines 0.170086
26 sci.space com 0.140952
27 sci.space writes 0.125414
28 sci.space orbit 0.124083
29 sci.space article 0.121807
30 talk.politics.misc edu 0.312622
31 talk.politics.misc com 0.228587
32 talk.politics.misc people 0.195317
33 talk.politics.misc stephanopoulos 0.191264
34 talk.politics.misc writes 0.175814
35 talk.politics.misc article 0.169791
36 talk.politics.misc don 0.148280
37 talk.politics.misc subject 0.142831
38 talk.politics.misc lines 0.139676
39 talk.politics.misc organization 0.137955
Thinking

Considering output structure

It seems the goal is to present the top keywords grouped by category. I’m thinking about whether to display them as sequential tables or a single pivoted table. A pivoted table could be clearer, with categories as rows and ranks 1-10 as columns. I'll aim for minimal code based on existing top_keywords. Truncating the output and possibly addressing tfidf order would be important too. Let’s figure out the best way to construct this DataFrame.Refining the code output

To simplify things, I’ll create a clean version of top_keywords. The code snippet I've crafted allows me to assign ranks within categories and round the tfidf values. This should generate a full table displaying all relevant data—40 rows might show up in a notebook by default, but to be safe, I might set the pandas option to display up to 50 rows. I need to mention that if there are truncations, patterns may still be visible. I'll just provide insights and code without asking for any execution or sharing.

The visible TF-IDF output is already pointing to sensible category-specific terms:

  • comp.graphics is led by content words like graphics, image, and jpeg, which match the category well.
  • rec.sport.hockey is led by hockey, nhl, and team, also a strong match.
  • But metadata-style tokens such as edu, subject, lines, and organization are still showing up, so the extracted keywords are partly influenced by headers rather than just topic vocabulary.

The current table is truncated, so here is a compact full ranking for all categories from the existing top_keywords result:

Rank and format keywords
top_keywords_full = top_keywords.copy()
top_keywords_full["rank"] = top_keywords_full.groupby("category").cumcount() + 1
top_keywords_full["tfidf"] = top_keywords_full["tfidf"].round(3)
top_keywords_full = top_keywords_full[["category", "rank", "keyword", "tfidf"]].sort_values(["category", "rank"]).reset_index(drop=True)

pd.set_option("display.max_rows", 50)
top_keywords_full
category rank keyword tfidf
0 comp.graphics 1 edu 0.355
1 comp.graphics 2 graphics 0.253
2 comp.graphics 3 image 0.223
3 comp.graphics 4 lines 0.221
4 comp.graphics 5 subject 0.214
5 comp.graphics 6 organization 0.197
6 comp.graphics 7 com 0.146
7 comp.graphics 8 jpeg 0.145
8 comp.graphics 9 university 0.123
9 comp.graphics 10 posting 0.105
10 rec.sport.hockey 1 hockey 0.297
11 rec.sport.hockey 2 edu 0.259
12 rec.sport.hockey 3 nhl 0.192
13 rec.sport.hockey 4 team 0.169
14 rec.sport.hockey 5 ca 0.160
15 rec.sport.hockey 6 subject 0.152
16 rec.sport.hockey 7 lines 0.149
17 rec.sport.hockey 8 organization 0.146
18 rec.sport.hockey 9 game 0.143
19 rec.sport.hockey 10 season 0.142
20 sci.space 1 space 0.388
21 sci.space 2 edu 0.324
22 sci.space 3 nasa 0.204
23 sci.space 4 subject 0.176
24 sci.space 5 organization 0.170
25 sci.space 6 lines 0.170
26 sci.space 7 com 0.141
27 sci.space 8 writes 0.125
28 sci.space 9 orbit 0.124
29 sci.space 10 article 0.122
30 talk.politics.misc 1 edu 0.313
31 talk.politics.misc 2 com 0.229
32 talk.politics.misc 3 people 0.195
33 talk.politics.misc 4 stephanopoulos 0.191
34 talk.politics.misc 5 writes 0.176
35 talk.politics.misc 6 article 0.170
36 talk.politics.misc 7 don 0.148
37 talk.politics.misc 8 subject 0.143
38 talk.politics.misc 9 lines 0.140
39 talk.politics.misc 10 organization 0.138
Thinking

Deciding on keyword extraction

I need to choose between Path A or B to extract the top TF-IDF keywords per category. It seems like the recent outputs, particularly output-5 and output-6, are truncated for categories beyond the first. I think we need a full table to truly rank values. Path B might be better, as it could allow us to display the complete table with the necessary insights. I'll ensure to include a single code block to keep the output concise and focused.Creating full rankings display

I need to code a display for full rankings, ensuring it's notably different from the last step, which merely assigned and showed the top keywords. Here, I think using the display function for each category in sequence makes sense, or maybe I should convert them to a pivot format. I plan to write minimal code and leverage a loop for sequential display without needing a final expression. The current outputs suggest a good signal-to-noise ratio, so I'll focus on presenting that clearly.

The extraction is already working: the visible results show strong topic terms like graphics, image, jpeg for comp.graphics and hockey, nhl, team for rec.sport.hockey. At the same time, header-like terms such as edu, subject, organization, and lines are still ranking highly, so the current keywords are a mix of true topic words and message metadata.

The only missing piece is the full per-category ranking, since the current notebook display is truncated. This shows the complete top TF-IDF keywords for each category:

Notebook code
for category in top_keywords["category"].drop_duplicates():
    display(
        top_keywords.loc[top_keywords["category"] == category]
        .assign(
            rank=lambda d: range(1, len(d) + 1),
            tfidf=lambda d: d["tfidf"].round(3)
        )[["category", "rank", "keyword", "tfidf"]]
        .reset_index(drop=True)
    )
category rank keyword tfidf
0 comp.graphics 1 edu 0.355
1 comp.graphics 2 graphics 0.253
2 comp.graphics 3 image 0.223
3 comp.graphics 4 lines 0.221
4 comp.graphics 5 subject 0.214
5 comp.graphics 6 organization 0.197
6 comp.graphics 7 com 0.146
7 comp.graphics 8 jpeg 0.145
8 comp.graphics 9 university 0.123
9 comp.graphics 10 posting 0.105
category rank keyword tfidf
0 rec.sport.hockey 1 hockey 0.297
1 rec.sport.hockey 2 edu 0.259
2 rec.sport.hockey 3 nhl 0.192
3 rec.sport.hockey 4 team 0.169
4 rec.sport.hockey 5 ca 0.160
5 rec.sport.hockey 6 subject 0.152
6 rec.sport.hockey 7 lines 0.149
7 rec.sport.hockey 8 organization 0.146
8 rec.sport.hockey 9 game 0.143
9 rec.sport.hockey 10 season 0.142
category rank keyword tfidf
0 sci.space 1 space 0.388
1 sci.space 2 edu 0.324
2 sci.space 3 nasa 0.204
3 sci.space 4 subject 0.176
4 sci.space 5 organization 0.170
5 sci.space 6 lines 0.170
6 sci.space 7 com 0.141
7 sci.space 8 writes 0.125
8 sci.space 9 orbit 0.124
9 sci.space 10 article 0.122
category rank keyword tfidf
0 talk.politics.misc 1 edu 0.313
1 talk.politics.misc 2 com 0.229
2 talk.politics.misc 3 people 0.195
3 talk.politics.misc 4 stephanopoulos 0.191
4 talk.politics.misc 5 writes 0.176
5 talk.politics.misc 6 article 0.170
6 talk.politics.misc 7 don 0.148
7 talk.politics.misc 8 subject 0.143
8 talk.politics.misc 9 lines 0.140
9 talk.politics.misc 10 organization 0.138
Thinking

Crafting concise insights

I need to ensure that I provide an answer that aligns with path A and connects to the goal. I can include concrete insights without using code or excessive headings. Instead of restating full tables, I can summarize the top keywords per category from the output. I should mention that the headers leaked due to the fetch_20newsgroups default retaining unnecessary information. Additionally, I’ll note that some of the extracted keywords are artifacts. Finally, I’ll add a final answer marker to wrap it up.

From the observed TF-IDF rankings, the category-specific keywords are extracted and they largely align with the topics:

  • comp.graphics: graphics, image, jpeg are the most meaningful topic words. This category also contains non-topic terms like edu, lines, subject, organization, com, and posting.
  • rec.sport.hockey: hockey, nhl, team, game, season clearly capture the category. Some metadata-style terms still appear, such as edu, ca, subject, lines, and organization.
  • sci.space: space, nasa, and orbit are strong topic keywords. The rest again includes message-header vocabulary like edu, subject, organization, lines, com, writes, and article.
  • talk.politics.misc: the most distinctive content terms are weaker here, but people and stephanopoulos stand out as topical. Many top terms are still generic or header-related: edu, com, writes, article, subject, lines, organization.

So the extraction worked, but the results are partly contaminated by email/news headers and other boilerplate. The strongest topic keywords visible from the current outputs are:

  • comp.graphicsgraphics, image, jpeg
  • rec.sport.hockeyhockey, nhl, team, game, season
  • sci.spacespace, nasa, orbit
  • talk.politics.miscpeople, stephanopoulos

This means the notebook has successfully extracted top TF-IDF keywords per category, with the main limitation being that header terms are still included in the vocabulary.

Browse All Analysis Examples