Topic Modeling with LDA in Python

Apply LDA topic modeling to a news article dataset, extract coherent topics, and visualize topic-word distributions.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the 20 Newsgroups training split from scikit-learn and converts the text into a TF-IDF document-term matrix while reporting the vocabulary size. It fits a 5-topic Latent Dirichlet Allocation (LDA) model and prints the top words for each topic to support interpretation. It visualizes topic prevalence across the corpus and identifies the dominant topic for a small sample of documents.

Who this example is for

This is for data analysts and NLP practitioners who want a reproducible, code-generating example of LDA topic modeling on a standard benchmark dataset. It helps users validate preprocessing choices, inspect topic-word distributions, and connect topics back to representative documents.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

TF-IDF vocabulary size report after filtering
Top 10 words per topic for 5 LDA topics
Bar chart of topic prevalence across the dataset
Table of 5 sample documents with dominant topic labels

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

scikit-learn
pandas
numpy
matplotlib
seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

1load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size
2fit an LDA model with 5 topics and print top 10 words per topic
3plot topic distribution across the dataset
4show which topic dominates for a sample of 5 documents

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model Source	Score	Task	Execution	Output	Reasoning	Reliability	Run
gemma4:31b	10/10	2/2	2/2	3/3	2/2	1/1	Open gemma4:31b conversation
gpt-oss:120b	10/10	2/2	2/2	3/3	2/2	1/1	Open gpt-oss:120b conversation
qwen3.5:397b	10/10	2/2	2/2	3/3	2/2	1/1	Open qwen3.5:397b conversation
gpt-5.4	8/10	2/2	2/2	2/3	1/2	1/1	Open gpt-5.4 conversation
qwen3-coder-next	8/10	2/2	2/2	2/3	1/2	1/1	Open qwen3-coder-next conversation
glm-5.1	7/10	2/2	2/2	2/3	1/2	0/1	Open glm-5.1 conversation

How scoring works

Each run is graded on five dimensions:

Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Topic Modeling with LDA in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size

Thinking

Planning dataset handling

I need to take this one step at a time since the user is asking me to load a dataset, vectorize it with TF-IDF, and show the vocabulary size. The first step should be to load and inspect the dataset. I’ll show the shape, columns, and head of the data. It seems the dataset might not be a DataFrame necessarily, but I should adapt by creating one with target and text. I think I'll load the training split first for initial inspection.Clarifying the vectorization process

The user wants to vectorize, but I need to focus on the first step only. First, I’ll load the dataset into a DataFrame so we can inspect it before vectorizing. My code will fetch the 20 newsgroups for the training subset without removing headers, footers, or quotes—since I shouldn't make assumptions. I’ll show the shape, columns, and head of the DataFrame. While I want to avoid multiple tables, it seems acceptable to display the shape and columns as text, then show the head as the final output.Organizing dataset displays

I’m thinking of using a plain dictionary to create a summary and then include the DataFrame's head. In a notebook, dictionaries display as a representation rather than a table—still, that’s okay. I’ll create a summary dictionary with the shape, column list, and class count. For the final output, I’ll display the dictionary summary followed by the DataFrame’s head. This way, I can share both the summary and the table in one go, which seems efficient.

Let’s start by loading the 20 Newsgroups training set into a simple DataFrame and inspecting it before vectorizing.

Load newsgroups dataset

import pandas as pd
from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset="train")

df = pd.DataFrame({
    "text": news.data,
    "target": news.target,
    "target_name": [news.target_names[i] for i in news.target]
})

summary = {
    "shape": df.shape,
    "columns": df.columns.tolist(),
    "n_classes": len(news.target_names)
}
display(summary)

df.head()

{'shape': (11314, 3),
 'columns': ['text', 'target', 'target_name'],
 'n_classes': 20}

	text	target	target_name
0	From: lerxst@wam.umd.edu (where's my thing)\nS...	7	rec.autos
1	From: guykuo@carson.u.washington.edu (Guy Kuo)...	4	comp.sys.mac.hardware
2	From: twillis@ec.ecn.purdue.edu (Thomas E Will...	4	comp.sys.mac.hardware
3	From: jgreen@amber (Joe Green)\nSubject: Re: W...	1	comp.graphics
4	From: jcm@head-cfa.harvard.edu (Jonathan McDow...	14	sci.space

The data is in place for the text column you need: there are 11,314 training documents, 20 classes, and the raw text is in df["text"]. The missing piece for your original goal is the TF-IDF vectorization and the vocabulary size.

Compute TF IDF features

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df["text"])

result = {
    "tfidf_matrix_shape": X_tfidf.shape,
    "vocabulary_size": len(tfidf.vocabulary_)
}
result

{'tfidf_matrix_shape': (11314, 130107), 'vocabulary_size': 130107}

#10

The 20 Newsgroups training split was loaded from sklearn and contains 11,314 documents across 20 categories.

A TF-IDF vectorizer was then fit on the raw text in df["text"], producing a sparse document-term matrix with shape (11314, 130107). That means:

11,314 rows = documents
130,107 columns = unique terms in the learned TF-IDF vocabulary

So the vocabulary size is 130,107.

#11

fit an LDA model with 5 topics and print top 10 words per topic

#12

Thinking

Considering LDA implementation

I need to produce a single code block for the user’s request to fit an LDA model with 5 topics and display the top 10 words for each topic. While LDA usually benefits from count data rather than TF-IDF, the notebook state already has TF-IDF set up. I should mention that counts are best practice for LDA, so I’ll work with that. I’ll craft a concise explanation for clarity as well.Preparing LDA code

I'm working on the code for the LDA model, and I'll be importing CountVectorizer and LDA. The user didn’t specify some parameters, so I shouldn't assume anything. I’m considering using max features or the full vocabulary, though that might be slow with my documents. I want to include settings like stop_words='english' and min_df/max_df settings to improve topic quality. Ultimately, I’ll display topics_df, ensuring it’s in one code block since they asked for the top 10 words per topic.

#13

For topic modeling, LDA usually works better on term counts than on TF-IDF weights, so I’d fit it on a fresh CountVectorizer matrix here.

#14

Fit LDA topics

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd

# Count features for LDA
cv = CountVectorizer(stop_words="english", max_df=0.95, min_df=5)
X_count = cv.fit_transform(df["text"])

# Fit 5-topic LDA
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X_count)

# Top 10 words per topic
feature_names = cv.get_feature_names_out()
rows = []

for topic_idx, topic in enumerate(lda.components_):
    top_idx = topic.argsort()[-10:][::-1]
    top_words = [feature_names[i] for i in top_idx]
    rows.append({
        "topic": f"Topic {topic_idx + 1}",
        "top_10_words": ", ".join(top_words)
    })

topics_df = pd.DataFrame(rows)
topics_df

#15

	topic	top_10_words
0	Topic 1	edu, god, jesus, university, team, writes, gam...
1	Topic 2	ax, max, g9v, b8f, a86, 145, pl, 1d9, 0d, 34u
2	Topic 3	edu, com, posting, host, nntp, university, art...
3	Topic 4	space, file, edu, use, com, nasa, program, inf...
4	Topic 5	edu, com, people, writes, don, article, just, ...

#16

The 5-topic LDA model did run and produced top words for each topic.

From the observed output, the topics are:

Topic 1: edu, god, jesus, university, team, writes, gam...
Topic 2: ax, max, g9v, b8f, a86, 145, pl, 1d9, 0d, 34u
Topic 3: edu, com, posting, host, nntp, university, art...
Topic 4: space, file, edu, use, com, nasa, program, inf...
Topic 5: edu, com, people, writes, don, article, just, ...

A couple of concrete takeaways from these results:

Topic 4 is clearly the most interpretable, centered on space/NASA/computing-related language.
Topic 1 appears mixed, with religion terms like god and jesus, but also more general forum words.
Topic 2 looks noisy and dominated by token fragments such as ax, g9v, and b8f, which suggests the raw text still contains artifacts that the vectorizer treated as words.
Topics 3 and 5 are influenced by general Usenet/posting metadata terms like edu, com, posting, host, nntp, writes, and article, so they are less semantically clean.

So the requested LDA model with 5 topics was fit, and the top words per topic were printed. The output indicates that some topics are meaningful, but several are still dominated by email/forum header vocabulary and text artifacts.

#17

plot topic distribution across the dataset

#18

The next useful step is to get each document’s topic proportions from the fitted LDA model, assign the dominant topic, and plot how many documents fall into each topic.

#19

Plot topic distribution

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Document-topic probabilities
doc_topic = lda.transform(X_count)

# Dominant topic per document
dominant_topic = doc_topic.argmax(axis=1) + 1

topic_counts = (
    pd.Series(dominant_topic, name="topic")
    .value_counts()
    .sort_index()
    .rename_axis("topic")
    .reset_index(name="document_count")
)

plt.figure(figsize=(8, 4))
sns.barplot(data=topic_counts, x="topic", y="document_count", color="steelblue")
plt.title("LDA Topic Distribution Across Documents")
plt.xlabel("Topic")
plt.ylabel("Number of Documents")
plt.tight_layout()

topic_counts

#20

#21

The topic distribution is quite uneven across the 11,314 documents.

Document counts by dominant topic:

Topic 1: 1,736
Topic 2: 24
Topic 3: 4,907
Topic 4: 1,272
Topic 5: 3,375

What this means from the plot:

Topic 3 is the largest cluster, covering about 43% of the dataset.
Topic 5 is the second largest, at about 30%.
Topic 1 and Topic 4 are mid-sized groups.
Topic 2 is almost absent, with only 24 documents, which matches the earlier top-word output showing noisy token fragments rather than a coherent theme.

So the plot shows that most documents are being assigned to Topics 3 and 5, while Topic 2 is a very small outlier topic.

#22

show which topic dominates for a sample of 5 documents

#23

Thinking

Creating a sample table

I need to avoid using multiple tables, so I’ll create a copy using sample indices from df.sample(5, random_state=42). I’ll make sure to include the first 300 characters of text, and check that the dominant_topic exists from the previous cell. It would be wise to recompute it from doc_topic if it’s available. For my table, I’ll include index, target_name, dominant_topic, dominant_topic_prob, and text_preview, and then sort it by the original index. I can mention that if needed, we can inspect the full topic mixture, but I don't think it’s necessary.

#24

Here’s a compact view of 5 sample documents with their dominant LDA topic and that topic’s probability.

#25

Create topic sample preview

sample_idx = df.sample(5, random_state=42).index

sample_docs = df.loc[sample_idx, ["target_name", "text"]].copy()
sample_docs["dominant_topic"] = dominant_topic[sample_idx]
sample_docs["dominant_topic_prob"] = doc_topic[sample_idx].max(axis=1).round(3)
sample_docs["text_preview"] = sample_docs["text"].str.replace(r"\s+", " ", regex=True).str.slice(0, 180)

sample_docs = sample_docs.drop(columns="text").reset_index(names="doc_id")
sample_docs

#26

	doc_id	target_name	dominant_topic	dominant_topic_prob	text_preview
0	7492	comp.sys.mac.hardware	3	0.852	From: rrn@po.CWRU.Edu (Robert R. Novitskey) Su...
1	3546	comp.os.ms-windows.misc	3	0.934	From: ardie@ux1.cso.uiuc.edu (Ardie Mack) Subj...
2	5582	misc.forsale	3	0.965	From: tsa@cellar.org (The Silent Assassin) Sub...
3	4793	talk.politics.guns	5	0.795	From: guy@idacom.hp.com (Guy M. Trotter) Subje...
4	3813	rec.sport.hockey	1	0.723	From: jwodzia@fadel.uucp (john wodziak) Subjec...

#27

Thinking

Crafting concise insights

I need to follow path A since the latest output aligns with the goal of showing which topic dominates within a sample of five documents. I should provide a clear final answer based on the observed outputs, including a summary instead of a verbatim table. I might mention the dominant topic probability for each document and connect it to the target name, summarizing the observed outputs and providing previews of the document IDs. Let’s make it concise!

#28

For the 5 sampled documents, the dominant topics are:

Doc 7492 (comp.sys.mac.hardware) → Topic 3 with probability 0.852
Doc 3546 (comp.os.ms-windows.misc) → Topic 3 with probability 0.934
Doc 5582 (misc.forsale) → Topic 3 with probability 0.965
Doc 4793 (talk.politics.guns) → Topic 5 with probability 0.795
Doc 3813 (rec.sport.hockey) → Topic 1 with probability 0.723

What this shows:

Topic 3 dominates 3 of the 5 sampled documents, and with fairly strong confidence in each case.
Topic 5 appears for the politics-related sample.
Topic 1 appears for the hockey-related sample.
The highest-confidence assignment in this sample is the misc.forsale document to Topic 3 at 0.965, suggesting that document is very strongly associated with that topic.

So, in this small sample, Topic 3 is the most common dominant topic.

Score cards for gpt-5.4 for Topic Modeling with LDA in Python

Total score: 8/10

Task Completion: 2/2
Excellent
All major requested steps were completed: loading 20 Newsgroups, TF-IDF vectorization with vocabulary size, 5-topic LDA with top words, topic distribution plot, and a 5-document dominant-topic table.
Execution Correctness: 2/2
Excellent
The provided code is coherent and likely runnable as written. It uses valid sklearn and pandas operations and produces the shown outputs.
Output Quality: 2/3
Good
Outputs match the requested workflow semantically, including the vocabulary size, topic word lists, bar chart, and sample document-topic assignments. The only limitation is that the LDA topics are somewhat noisy, but the required outputs are present.
Reasoning Quality: 1/2
Needs work
The notebook correctly notes that LDA works better on counts than TF-IDF and explains the topic outputs, but the workflow still mixes TF-IDF for the initial step and count vectors for LDA without clearly reconciling that choice.
Reliability: 1/1
Excellent
The workflow is mostly consistent and robust, though topic quality is affected by raw headers and artifacts, and the analysis relies on a separate count vectorization rather than the TF-IDF matrix mentioned in the prompt.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Download Studio Learn about AI Data Analyst

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.

Topic Modeling with LDA in Python

What this AI workflow does

Who this example is for

Expected analysis outcomes

Tools and libraries used

Prompt sequence

Model Comparison

AI Analysis with gpt-5.4

Score cards for gpt-5.4 for Topic Modeling with LDA in Python

Try MLJAR Studio

Explore More AI Analysis Examples

Exploratory Data Analysis

Time Series

Data Analysis

Machine Learning

NLP

Finance

Statistics