Nlp

Topic Modeling with LDA in Python

Apply LDA topic modeling to a news article dataset, extract coherent topics, and visualize topic-word distributions.

What

This AI Data Analyst workflow loads the 20 Newsgroups training split from scikit-learn and converts the text into a TF-IDF document-term matrix while reporting the vocabulary size. It fits a 5-topic Latent Dirichlet Allocation (LDA) model and prints the top words for each topic to support interpretation. It visualizes topic prevalence across the corpus and identifies the dominant topic for a small sample of documents.

Who

This is for data analysts and NLP practitioners who want a reproducible, code-generating example of LDA topic modeling on a standard benchmark dataset. It helps users validate preprocessing choices, inspect topic-word distributions, and connect topics back to representative documents.

Tools

scikit-learn
pandas
numpy
matplotlib
seaborn

Outcomes

TF-IDF vocabulary size report after filtering
Top 10 words per topic for 5 LDA topics
Bar chart of topic prevalence across the dataset
Table of 5 sample documents with dominant topic labels

Quality Score

8/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

Task Completion: 1/2
Needs work
All prompted steps were executed (load 20NG, vectorize, fit 5-topic LDA, plot topic prevalence, show dominant topic for 5 docs). However, the expected outcome of vocabulary size ~50k after filtering was not met (reported ~130k), and topics are not clearly coherent/interpretable as requested.
Execution Correctness: 2/2
Excellent
Code uses standard sklearn APIs correctly (fetch_20newsgroups, TfidfVectorizer/CountVectorizer, LatentDirichletAllocation, transform, plotting) and outputs are consistent with the code shown.
Output Quality: 2/3
Good
Outputs include vocabulary size, top-10 word lists for 5 topics, a topic-count table and bar chart, and a 5-row table with document preview + dominant topic/probability. But the vocabulary size deviates substantially from the expected ~50k and several topics are dominated by header/noise tokens, reducing semantic match to 'coherent topics'.
Reasoning Quality: 2/2
Excellent
Reasoning appropriately notes that LDA works better with counts than TF-IDF and correctly interprets noisy topics and skewed topic prevalence based on the printed top words and counts.
Reliability: 1/1
Excellent
Workflow is reasonably robust and reproducible (fixed random_state/seed, uses common libraries); no hallucinated functions or unsafe steps, though preprocessing choices are minimal and lead to noisy topics.

load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size

Thinking

Loading and inspecting data

I need to load a dataset using fetch_20newsgroups, focusing on the training subset and possibly removing some features. I’ll create a DataFrame with text and target, then show its shape and columns. While I could access target_names, it seems reasonable to skip displaying that in this case. It sounds like I'll mostly just display the DataFrame's head, along with its shape and columns, using plain Python variables, since that fits the instruction to keep it simple.

Let’s start by loading the 20 Newsgroups training split and inspecting its basic structure before vectorizing.

Load news dataset

from sklearn.datasets import fetch_20newsgroups
import pandas as pd

news = fetch_20newsgroups(subset="train")

df = pd.DataFrame({
    "text": news.data,
    "target": news.target
})

display(pd.Series({
    "shape": df.shape,
    "columns": df.columns.tolist(),
    "n_classes": len(news.target_names)
}))

df.head()

shape            (11314, 2)
columns      [text, target]
n_classes                20
dtype: object

	text	target
0	From: lerxst@wam.umd.edu (where's my thing)\nS...	7
1	From: guykuo@carson.u.washington.edu (Guy Kuo)...	4
2	From: twillis@ec.ecn.purdue.edu (Thomas E Will...	4
3	From: jgreen@amber (Joe Green)\nSubject: Re: W...	1
4	From: jcm@head-cfa.harvard.edu (Jonathan McDow...	14

The data is in place and looks ready for text vectorization: you have 11,314 training documents, a text column with the raw posts, and 20 target classes. The missing piece for your original request is the TF-IDF transformation and the vocabulary size.

Compute TF IDF features

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df["text"])

result = {
    "tfidf_shape": X.shape,
    "vocabulary_size": len(tfidf.vocabulary_)
}

result

{'tfidf_shape': (11314, 130107), 'vocabulary_size': 130107}

#10

You loaded the 20 Newsgroups training data and converted the raw text into a TF-IDF feature matrix.

From the observed outputs:

The dataset contains 11,314 documents
The TF-IDF matrix has shape (11,314, 130,107)
The vocabulary size is 130,107

That means the vectorizer found 130,107 unique terms across the training corpus, and each document is represented by TF-IDF weights over that vocabulary.

#11

fit an LDA model with 5 topics and print top 10 words per topic

#12

LDA is usually more meaningful on term counts than on TF-IDF, so I’d switch to a CountVectorizer for this topic-modeling step.

#13

Extract LDA topics

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd

cv = CountVectorizer(stop_words="english", max_df=0.95, min_df=2)
X_counts = cv.fit_transform(df["text"])

lda = LatentDirichletAllocation(
    n_components=5,
    random_state=42,
    learning_method="batch"
)
lda.fit(X_counts)

terms = cv.get_feature_names_out()

topics = []
for i, topic in enumerate(lda.components_):
    top_idx = topic.argsort()[-10:][::-1]
    topics.append({
        "topic": i,
        "top_words": ", ".join(terms[top_idx])
    })

pd.DataFrame(topics)

#14

	topic	top_words
0	0	people, edu, god, writes, think, don, com, art...
1	1	ax, max, g9v, b8f, a86, 145, pl, 1d9, 0d, 34u
2	2	edu, university, writes, article, ca, posting,...
3	3	com, edu, article, writes, posting, drive, hos...
4	4	edu, com, use, file, posting, windows, space, ...

#15

The 5-topic LDA model has been fit, and the output already shows the top 10 words for each topic.

Top words per topic from the observed result:

Topic 0: people, edu, god, writes, think, don, com, article, just, say
Topic 1: ax, max, g9v, b8f, a86, 145, pl, 1d9, 0d, 34u
Topic 2: edu, university, writes, article, ca, posting, host, nntp, cs, com
Topic 3: com, edu, article, writes, posting, drive, host, nntp, disk, know
Topic 4: edu, com, use, file, posting, windows, space, host, nntp, program

What this means:

The model did produce 5 distinct latent topics and surfaced 10 highest-weighted terms for each.
Several topics are dominated by metadata-like tokens such as edu, com, writes, posting, host, and nntp, which suggests the model is picking up email/header structure in the 20 Newsgroups text.
Topic 1 is especially noisy, with tokens like ax, g9v, b8f, and 34u, which look like encoded or artifact-like text rather than interpretable semantic words.

So the requested LDA result is available, but the topics are only partly interpretable because the raw text still contains a lot of header and message-format noise.

#16

plot topic distribution across the dataset

#17

Thinking

Planning topic distribution plotting

I need to respond with a single step of code to plot the topic distribution across the dataset using the notebook state. My goal is to compute the dominant topic per document from lda.transform(X_counts) and create just one chart, likely using seaborn or matplotlib. I should include a brief explanation but ensure not to assume outputs. There will be no mutations or multiple tables. I'll craft concise code for a countplot or barplot. This should keep it straightforward!

#18

A simple way to see topic distribution is to assign each document to its most probable topic from the LDA model, then count documents per topic.

#19

Plot topic distribution

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

doc_topic = lda.transform(X_counts)
dominant_topic = doc_topic.argmax(axis=1)

topic_counts = pd.Series(dominant_topic).value_counts().sort_index()
plot_df = topic_counts.rename_axis("topic").reset_index(name="document_count")

plt.figure(figsize=(8, 4))
sns.barplot(data=plot_df, x="topic", y="document_count", color="steelblue")
plt.title("LDA Topic Distribution Across Documents")
plt.xlabel("Dominant topic")
plt.ylabel("Number of documents")
plt.tight_layout()
plt.show()

#20

#21

Thinking

Clarifying chart requirements

I need to create an answer based on option A/B with just a chart, excluding numeric values. I'll go with B first to ensure I print the values used in the chart. Then, I need to gather brief insights from current outputs and provide one minimal runnable code block to collect the missing data. I should analyze the latest output and the notebook state against the goal, likely displaying plot_df or topic_counts in a compact way.

#22

The plot was generated successfully, so the topic distribution is available visually. Based on the earlier topic-word output, this chart is showing how many documents are assigned to each of the 5 LDA topics as their dominant topic. To interpret that distribution clearly, the missing piece is the actual document counts behind the bars.

#23

Display plot data

plot_df

#24

	topic	document_count
0	0	2510
1	1	24
2	2	2429
3	3	2870
4	4	3481

#25

The topic distribution is quite uneven across the 11,314 documents.

Topic 4 is the largest with 3,481 documents.
Topic 3 follows with 2,870 documents.
Topic 0 has 2,510 documents.
Topic 2 has 2,429 documents.
Topic 1 is extremely rare with only 24 documents.

This means most documents are concentrated in Topics 0, 2, 3, and 4, while Topic 1 is almost absent in comparison. That very small Topic 1 count matches the unusual top words seen earlier (ax, max, g9v, b8f, ...), which suggests it may be capturing noisy or artifact-like text rather than a broad semantic theme. Overall, the plot shows a strongly imbalanced topic distribution rather than five similarly sized topics.

#26

show which topic dominates for a sample of 5 documents

#27

Thinking

Creating code for topic dominance

I need to craft a single code block to respond to the user's request about showing the dominant topic for a sample of five documents. I'll use the existing notebook state, creating a sample from the DataFrame to display snippets alongside the dominant topic and possibly the top topic probabilities. The goal is to keep it concise while preserving the target variable. I'll make sure to include a table format for display. Let’s get to it!

#28

Here’s a small sample of 5 documents with their dominant LDA topic. I’m also showing the topic confidence and a short text preview so it’s easy to inspect.

#29

Create topic sample table

import pandas as pd

sample_idx = df.sample(5, random_state=42).index

sample_topics = pd.DataFrame(doc_topic[sample_idx]).idxmax(axis=1).to_numpy()
sample_conf = pd.DataFrame(doc_topic[sample_idx]).max(axis=1).round(3).to_numpy()

sample_df = df.loc[sample_idx, ["target", "text"]].copy()
sample_df["dominant_topic"] = sample_topics
sample_df["topic_confidence"] = sample_conf
sample_df["text_preview"] = sample_df["text"].str.replace("\n", " ", regex=False).str.slice(0, 200)

sample_df = sample_df[["target", "dominant_topic", "topic_confidence", "text_preview"]].reset_index(names="doc_index")
sample_df

#30

	doc_index	target	dominant_topic	topic_confidence	text_preview
0	7492	4	2	0.984	From: rrn@po.CWRU.Edu (Robert R. Novitskey) Su...
1	3546	2	2	0.601	From: ardie@ux1.cso.uiuc.edu (Ardie Mack) Subj...
2	5582	6	3	0.403	From: tsa@cellar.org (The Silent Assassin) Sub...
3	4793	16	3	0.835	From: guy@idacom.hp.com (Guy M. Trotter) Subje...
4	3813	10	2	0.993	From: jwodzia@fadel.uucp (john wodziak) Subjec...

#31

Thinking

Analyzing topic dominance

I'm determining whether to follow path A or B for presenting topic dominance across a sample of 5 documents. My goal is to convey insights based on the observed outputs, rather than just repeating what's already visible. Among the 5 samples, topic 2 dominates in 3 documents while topic 3 leads in 2. The confidence levels also vary, being high for some documents in topic 2 and showing interesting variations in target labels across topics. Let's wrap this up!

#32

In this 5-document sample, Topic 2 is the most common dominant topic: it appears in 3 of the 5 documents. Topic 3 dominates the other 2 documents.

A few concrete takeaways from the sample:

Very strong topic assignments appear for:
- document 3813 → Topic 2 with confidence 0.993
- document 7492 → Topic 2 with confidence 0.984
Moderate assignments:
- document 4793 → Topic 3 with confidence 0.835
- document 3546 → Topic 2 with confidence 0.601
Most ambiguous document in the sample:
- document 5582 → Topic 3 with confidence 0.403, so its dominant topic is weaker than the others

So for this sample, the dominant-topic pattern is:

Topic 2: documents 7492, 3546, 3813
Topic 3: documents 5582, 4793

The target labels in this sample are all different, which suggests these learned topics are not mapping one-to-one to the dataset’s class labels; instead, they are grouping documents by broader word-usage patterns.

Topic Modeling with LDA in Python

What

Who

Tools

Outcomes

Quality Score

Browse All Analysis Examples

Machine Learning

Time Series

Data Analysis

Exploratory Data Analysis

Finance

Nlp

Statistics