Nlp

Topic Modeling with LDA in Python

Apply LDA topic modeling to a news article dataset, extract coherent topics, and visualize topic-word distributions.

What

This AI Data Analyst workflow loads the 20 Newsgroups training split from scikit-learn and converts the text into a TF-IDF document-term matrix while reporting the vocabulary size. It fits a 5-topic Latent Dirichlet Allocation (LDA) model and prints the top words for each topic to support interpretation. It visualizes topic prevalence across the corpus and identifies the dominant topic for a small sample of documents.

Who

This is for data analysts and NLP practitioners who want a reproducible, code-generating example of LDA topic modeling on a standard benchmark dataset. It helps users validate preprocessing choices, inspect topic-word distributions, and connect topics back to representative documents.

Tools

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Outcomes

  • TF-IDF vocabulary size report after filtering
  • Top 10 words per topic for 5 LDA topics
  • Bar chart of topic prevalence across the dataset
  • Table of 5 sample documents with dominant topic labels

Quality Score

8/10

gpt-5.2mljar-ai

Last scored: Apr 7, 2026

  • Task Completion: 1/2

    Needs work

    All prompted steps were executed (load 20NG, vectorize, fit 5-topic LDA, plot topic prevalence, show dominant topic for 5 docs). However, the expected outcome of vocabulary size ~50k after filtering was not met (reported ~130k), and topics are not clearly coherent/interpretable as requested.

  • Execution Correctness: 2/2

    Excellent

    Code uses standard sklearn APIs correctly (fetch_20newsgroups, TfidfVectorizer/CountVectorizer, LatentDirichletAllocation, transform, plotting) and outputs are consistent with the code shown.

  • Output Quality: 2/3

    Good

    Outputs include vocabulary size, top-10 word lists for 5 topics, a topic-count table and bar chart, and a 5-row table with document preview + dominant topic/probability. But the vocabulary size deviates substantially from the expected ~50k and several topics are dominated by header/noise tokens, reducing semantic match to 'coherent topics'.

  • Reasoning Quality: 2/2

    Excellent

    Reasoning appropriately notes that LDA works better with counts than TF-IDF and correctly interprets noisy topics and skewed topic prevalence based on the printed top words and counts.

  • Reliability: 1/1

    Excellent

    Workflow is reasonably robust and reproducible (fixed random_state/seed, uses common libraries); no hallucinated functions or unsafe steps, though preprocessing choices are minimal and lead to noisy topics.

load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size
Thinking

Loading and inspecting data

I need to load a dataset using fetch_20newsgroups, focusing on the training subset and possibly removing some features. I’ll create a DataFrame with text and target, then show its shape and columns. While I could access target_names, it seems reasonable to skip displaying that in this case. It sounds like I'll mostly just display the DataFrame's head, along with its shape and columns, using plain Python variables, since that fits the instruction to keep it simple.

Let’s start by loading the 20 Newsgroups training split and inspecting its basic structure before vectorizing.

Load news dataset
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

news = fetch_20newsgroups(subset="train")

df = pd.DataFrame({
    "text": news.data,
    "target": news.target
})

display(pd.Series({
    "shape": df.shape,
    "columns": df.columns.tolist(),
    "n_classes": len(news.target_names)
}))

df.head()
shape            (11314, 2)
columns      [text, target]
n_classes                20
dtype: object
text target
0 From: lerxst@wam.umd.edu (where's my thing)\nS... 7
1 From: guykuo@carson.u.washington.edu (Guy Kuo)... 4
2 From: twillis@ec.ecn.purdue.edu (Thomas E Will... 4
3 From: jgreen@amber (Joe Green)\nSubject: Re: W... 1
4 From: jcm@head-cfa.harvard.edu (Jonathan McDow... 14

The data is in place and looks ready for text vectorization: you have 11,314 training documents, a text column with the raw posts, and 20 target classes. The missing piece for your original request is the TF-IDF transformation and the vocabulary size.

Compute TF IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df["text"])

result = {
    "tfidf_shape": X.shape,
    "vocabulary_size": len(tfidf.vocabulary_)
}

result
{'tfidf_shape': (11314, 130107), 'vocabulary_size': 130107}

You loaded the 20 Newsgroups training data and converted the raw text into a TF-IDF feature matrix.

From the observed outputs:

  • The dataset contains 11,314 documents
  • The TF-IDF matrix has shape (11,314, 130,107)
  • The vocabulary size is 130,107

That means the vectorizer found 130,107 unique terms across the training corpus, and each document is represented by TF-IDF weights over that vocabulary.

fit an LDA model with 5 topics and print top 10 words per topic

LDA is usually more meaningful on term counts than on TF-IDF, so I’d switch to a CountVectorizer for this topic-modeling step.

Extract LDA topics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd

cv = CountVectorizer(stop_words="english", max_df=0.95, min_df=2)
X_counts = cv.fit_transform(df["text"])

lda = LatentDirichletAllocation(
    n_components=5,
    random_state=42,
    learning_method="batch"
)
lda.fit(X_counts)

terms = cv.get_feature_names_out()

topics = []
for i, topic in enumerate(lda.components_):
    top_idx = topic.argsort()[-10:][::-1]
    topics.append({
        "topic": i,
        "top_words": ", ".join(terms[top_idx])
    })

pd.DataFrame(topics)
topic top_words
0 0 people, edu, god, writes, think, don, com, art...
1 1 ax, max, g9v, b8f, a86, 145, pl, 1d9, 0d, 34u
2 2 edu, university, writes, article, ca, posting,...
3 3 com, edu, article, writes, posting, drive, hos...
4 4 edu, com, use, file, posting, windows, space, ...

The 5-topic LDA model has been fit, and the output already shows the top 10 words for each topic.

Top words per topic from the observed result:

  • Topic 0: people, edu, god, writes, think, don, com, article, just, say
  • Topic 1: ax, max, g9v, b8f, a86, 145, pl, 1d9, 0d, 34u
  • Topic 2: edu, university, writes, article, ca, posting, host, nntp, cs, com
  • Topic 3: com, edu, article, writes, posting, drive, host, nntp, disk, know
  • Topic 4: edu, com, use, file, posting, windows, space, host, nntp, program

What this means:

  • The model did produce 5 distinct latent topics and surfaced 10 highest-weighted terms for each.
  • Several topics are dominated by metadata-like tokens such as edu, com, writes, posting, host, and nntp, which suggests the model is picking up email/header structure in the 20 Newsgroups text.
  • Topic 1 is especially noisy, with tokens like ax, g9v, b8f, and 34u, which look like encoded or artifact-like text rather than interpretable semantic words.

So the requested LDA result is available, but the topics are only partly interpretable because the raw text still contains a lot of header and message-format noise.

plot topic distribution across the dataset
Thinking

Planning topic distribution plotting

I need to respond with a single step of code to plot the topic distribution across the dataset using the notebook state. My goal is to compute the dominant topic per document from lda.transform(X_counts) and create just one chart, likely using seaborn or matplotlib. I should include a brief explanation but ensure not to assume outputs. There will be no mutations or multiple tables. I'll craft concise code for a countplot or barplot. This should keep it straightforward!

A simple way to see topic distribution is to assign each document to its most probable topic from the LDA model, then count documents per topic.

Plot topic distribution
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

doc_topic = lda.transform(X_counts)
dominant_topic = doc_topic.argmax(axis=1)

topic_counts = pd.Series(dominant_topic).value_counts().sort_index()
plot_df = topic_counts.rename_axis("topic").reset_index(name="document_count")

plt.figure(figsize=(8, 4))
sns.barplot(data=plot_df, x="topic", y="document_count", color="steelblue")
plt.title("LDA Topic Distribution Across Documents")
plt.xlabel("Dominant topic")
plt.ylabel("Number of documents")
plt.tight_layout()
plt.show()
Thinking

Clarifying chart requirements

I need to create an answer based on option A/B with just a chart, excluding numeric values. I'll go with B first to ensure I print the values used in the chart. Then, I need to gather brief insights from current outputs and provide one minimal runnable code block to collect the missing data. I should analyze the latest output and the notebook state against the goal, likely displaying plot_df or topic_counts in a compact way.

The plot was generated successfully, so the topic distribution is available visually. Based on the earlier topic-word output, this chart is showing how many documents are assigned to each of the 5 LDA topics as their dominant topic. To interpret that distribution clearly, the missing piece is the actual document counts behind the bars.

Display plot data
plot_df
topic document_count
0 0 2510
1 1 24
2 2 2429
3 3 2870
4 4 3481

The topic distribution is quite uneven across the 11,314 documents.

  • Topic 4 is the largest with 3,481 documents.
  • Topic 3 follows with 2,870 documents.
  • Topic 0 has 2,510 documents.
  • Topic 2 has 2,429 documents.
  • Topic 1 is extremely rare with only 24 documents.

This means most documents are concentrated in Topics 0, 2, 3, and 4, while Topic 1 is almost absent in comparison. That very small Topic 1 count matches the unusual top words seen earlier (ax, max, g9v, b8f, ...), which suggests it may be capturing noisy or artifact-like text rather than a broad semantic theme. Overall, the plot shows a strongly imbalanced topic distribution rather than five similarly sized topics.

show which topic dominates for a sample of 5 documents
Thinking

Creating code for topic dominance

I need to craft a single code block to respond to the user's request about showing the dominant topic for a sample of five documents. I'll use the existing notebook state, creating a sample from the DataFrame to display snippets alongside the dominant topic and possibly the top topic probabilities. The goal is to keep it concise while preserving the target variable. I'll make sure to include a table format for display. Let’s get to it!

Here’s a small sample of 5 documents with their dominant LDA topic. I’m also showing the topic confidence and a short text preview so it’s easy to inspect.

Create topic sample table
import pandas as pd

sample_idx = df.sample(5, random_state=42).index

sample_topics = pd.DataFrame(doc_topic[sample_idx]).idxmax(axis=1).to_numpy()
sample_conf = pd.DataFrame(doc_topic[sample_idx]).max(axis=1).round(3).to_numpy()

sample_df = df.loc[sample_idx, ["target", "text"]].copy()
sample_df["dominant_topic"] = sample_topics
sample_df["topic_confidence"] = sample_conf
sample_df["text_preview"] = sample_df["text"].str.replace("\n", " ", regex=False).str.slice(0, 200)

sample_df = sample_df[["target", "dominant_topic", "topic_confidence", "text_preview"]].reset_index(names="doc_index")
sample_df
doc_index target dominant_topic topic_confidence text_preview
0 7492 4 2 0.984 From: rrn@po.CWRU.Edu (Robert R. Novitskey) Su...
1 3546 2 2 0.601 From: ardie@ux1.cso.uiuc.edu (Ardie Mack) Subj...
2 5582 6 3 0.403 From: tsa@cellar.org (The Silent Assassin) Sub...
3 4793 16 3 0.835 From: guy@idacom.hp.com (Guy M. Trotter) Subje...
4 3813 10 2 0.993 From: jwodzia@fadel.uucp (john wodziak) Subjec...
Thinking

Analyzing topic dominance

I'm determining whether to follow path A or B for presenting topic dominance across a sample of 5 documents. My goal is to convey insights based on the observed outputs, rather than just repeating what's already visible. Among the 5 samples, topic 2 dominates in 3 documents while topic 3 leads in 2. The confidence levels also vary, being high for some documents in topic 2 and showing interesting variations in target labels across topics. Let's wrap this up!

In this 5-document sample, Topic 2 is the most common dominant topic: it appears in 3 of the 5 documents. Topic 3 dominates the other 2 documents.

A few concrete takeaways from the sample:

  • Very strong topic assignments appear for:
    • document 3813Topic 2 with confidence 0.993
    • document 7492Topic 2 with confidence 0.984
  • Moderate assignments:
    • document 4793Topic 3 with confidence 0.835
    • document 3546Topic 2 with confidence 0.601
  • Most ambiguous document in the sample:
    • document 5582Topic 3 with confidence 0.403, so its dominant topic is weaker than the others

So for this sample, the dominant-topic pattern is:

  • Topic 2: documents 7492, 3546, 3813
  • Topic 3: documents 5582, 4793

The target labels in this sample are all different, which suggests these learned topics are not mapping one-to-one to the dataset’s class labels; instead, they are grouping documents by broader word-usage patterns.

Browse All Analysis Examples