Topic Modeling with LDA in Python

Apply LDA topic modeling to a news article dataset, extract coherent topics, and visualize topic-word distributions.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the 20 Newsgroups training split from scikit-learn and converts the text into a TF-IDF document-term matrix while reporting the vocabulary size. It fits a 5-topic Latent Dirichlet Allocation (LDA) model and prints the top words for each topic to support interpretation. It visualizes topic prevalence across the corpus and identifies the dominant topic for a small sample of documents.

Who this example is for

This is for data analysts and NLP practitioners who want a reproducible, code-generating example of LDA topic modeling on a standard benchmark dataset. It helps users validate preprocessing choices, inspect topic-word distributions, and connect topics back to representative documents.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • TF-IDF vocabulary size report after filtering
  • Top 10 words per topic for 5 LDA topics
  • Bar chart of topic prevalence across the dataset
  • Table of 5 sample documents with dominant topic labels

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size
  2. 2fit an LDA model with 5 topics and print top 10 words per topic
  3. 3plot topic distribution across the dataset
  4. 4show which topic dominates for a sample of 5 documents

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
gemma4:31b9/102/22/23/31/21/1Open gemma4:31b conversation
gpt-5.49/102/22/23/31/21/1Open gpt-5.4 conversation
qwen3-coder-next7/102/22/22/31/20/1Open qwen3-coder-next conversation
glm-5.16/102/21/22/31/20/1Open glm-5.1 conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with glm-5.1

This run uses AI with a step-by-step list of prompts to perform the analysis task Topic Modeling with LDA in Python. Below, you can review the full conversation generated with glm-5.1. After the conversation, we show detailed score cards for this run.

load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size
Thinking

The user wants to load the 20 Newsgroups dataset from sklearn, vectorize it with TF-IDF, and show the vocabulary size. I'll do this in one step since it's a straightforward task.

I'll load the 20 Newsgroups dataset, apply TF-IDF vectorization, and report the vocabulary size.

Vectorize newsgroups with TF-IDF
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the dataset (train subset)
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Vectorize with TF-IDF
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(newsgroups.data)

# Show key info
X_tfidf.shape
(11314, 101631)
Count vocabulary size
len(vectorizer.vocabulary_)
101631
Thinking

The user's goal was to load 20 newsgroups data, vectorize with TF-IDF, and show vocabulary size. Looking at the notebook outputs:

  • The dataset was loaded: 11,314 documents
  • TF-IDF vectorization was applied
  • Vocabulary size: 101,631

This fully answers the original goal.

The 20 Newsgroups training set has been loaded (11,314 documents) and vectorized with TF-IDF. The resulting vocabulary size is 101,631 unique terms. This is quite large for a text dataset, reflecting the diversity of topics across the 20 newsgroup categories. In practice, you might reduce this by setting max_features, min_df, or max_df in TfidfVectorizer to focus on the most informative terms and reduce noise.

fit an LDA model with 5 topics and print top 10 words per topic
Thinking

The user wants to fit an LDA model with 5 topics on the 20 Newsgroups TF-IDF data and print the top 10 words per topic. I'll use LatentDirichletAllocation from sklearn. Note that LDA works with count data, not TF-IDF, but the user asked to use what we have. Actually, let me think — LDA is traditionally designed for count (bag-of-words) features, not TF-IDF. However, the user specifically asked to fit an LDA model, and we have the TF-IDF vectorizer. I could either:

  1. Use the TF-IDF matrix directly (common in practice even if not theoretically ideal)
  2. Create a CountVectorizer for LDA

The user said "vectorize with TF-IDF" in the original goal, so I'll use the existing TF-IDF matrix. Let me fit LDA on X_tfidf and print top 10 words per topic.

Actually, I should use the existing vectorizer's feature names to map topic-word distributions back to words.

I'll fit an LDA model with 5 topics on the TF-IDF matrix and print the top 10 words per topic.

Fit LDA and print top words
from sklearn.decomposition import LatentDirichletAllocation

# Fit LDA with 5 topics
lda = LatentDirichletAllocation(n_components=5, random_state=42, n_jobs=-1)
lda.fit(X_tfidf)

# Print top 10 words per topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_indices = topic.argsort()[-10:][::-1]
    top_words = [feature_names[i] for i in top_indices]
    print(f"Topic {topic_idx}: {', '.join(top_words)}")
Topic 0: rtrace, jh, smd, whirrr, eridan, er1, inguiry, chuvashia, _perijoves_, speedisk
Topic 1: the, to, of, and, is, that, in, it, you, for
Topic 2: xman, lssu, maine, astemizole, _incredibly_, mywinobj, f550i, narrative, anaheim, melittin
Topic 3: geb, shameful, cadre, chastity, n3jxp, dsl, intellect, skepticism, pitt, surrender
Topic 4: ax, ditto, libxmu, irwin, xmu, keywords, 6551, kendall, curran, woof
Thinking

The user's goal was to fit an LDA model with 5 topics and print the top 10 words per topic. This has been done successfully — the output shows 5 topics with their top 10 words each. The goal is fully satisfied.

However, looking at the output, the topics are quite poor — Topic 1 shows common English words (the, to, of, and, is...) and other topics show unusual/rare terms. This is because TF-IDF weighting with LDA can produce less interpretable topics, and without any preprocessing (removing stopwords, min_df filtering), the results are noisy.

But the user's original goal was just to fit LDA with 5 topics and print top 10 words per topic. That's been accomplished. I should provide a natural final answer noting the results and briefly commenting on their quality.

The LDA model with 5 topics has been fitted and the top 10 words per topic are printed above. A few observations about the results:

  • Topic 1 stands out as the most coherent, containing common English words (the, to, of, and, is, that, in, it, you, for). This suggests it captures general language rather than a specific theme.
  • Topics 0, 2, 3, and 4 are dominated by rare, domain-specific, or garbled terms (e.g., "chuvashia", "mywinobj", "melittin", "perijoves"). These likely reflect overfitting to infrequent tokens that happen to have high TF-IDF scores.

This is a known issue: running LDA directly on a full TF-IDF matrix with 101,631 features and no stopword filtering tends to surface noisy, low-frequency terms. The results would likely improve by filtering stopwords (stop_words='english'), setting min_df and max_df thresholds in the vectorizer, or using raw term counts (CountVectorizer) instead of TF-IDF, since LDA is a probabilistic model that expects count data rather than weighted frequencies.

plot topic distribution across the dataset
Thinking

The user wants to plot the topic distribution across the dataset. I need to transform the data using the fitted LDA model to get topic proportions for each document, then visualize the distribution.

I'll use lda.transform(X_tfidf) to get the topic proportions for each document, then create a visualization showing the distribution.

A good approach: compute the mean topic proportion across all documents, or show the distribution as a bar chart of average proportions, or show a histogram/violin plot of topic proportions.

Let me create a clear visualization - I'll compute the average topic proportion per topic across all documents and show it as a bar chart, plus maybe show the distribution with a box plot or violin plot.

I think a clean approach would be:

  1. Transform documents to get topic proportions
  2. Show a bar chart of mean topic proportions
  3. Or show a stacked bar / violin plot

Let me keep it simple and informative - I'll do a bar chart of mean proportions with error bars, or a violin/box plot showing the distribution of each topic's proportion across documents.

I'll use seaborn for this.

I'll transform the documents into topic proportions and visualize how the 5 topics are distributed across the dataset.

Plot topic proportions
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Get topic proportions for each document
doc_topics = lda.transform(X_tfidf)

# Build a DataFrame for plotting
topic_labels = [f"Topic {i}" for i in range(5)]
df_topics = pd.DataFrame(doc_topics, columns=topic_labels)

# Plot: box plot of topic proportions across all documents
df_melted = df_topics.melt(var_name="Topic", value_name="Proportion")

sns.set_style("whitegrid")
plt.figure(figsize=(9, 6))
sns.boxplot(data=df_melted, x="Topic", y="Proportion", hue="Topic", palette="Set2", legend=False)
plt.title("Topic Proportion Distribution Across 20 Newsgroups Documents")
plt.tight_layout()
plt.show()
Thinking

The user asked to "plot topic distribution across the dataset" and I can see the box plot has been generated (output-4 shows <Figure size 900x600 with 1 Axes>). The chart was successfully created.

However, the sufficiency heuristic says: "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."

The box plot was generated but we don't have the actual numeric summary statistics displayed. Let me provide the numeric summary of topic distributions to complement the visual output.

Actually, looking more carefully - the goal was to "plot topic distribution across the dataset" and a plot has been produced. The chart itself answers the visual goal. But per the heuristic, since the output is chart-only without underlying numeric values, I should choose Path B to print the values.

Let me provide the numeric summary statistics for each topic's proportion distribution.

The box plot has been generated, but since the output is visual-only, let me print the underlying numeric summary to give concrete insight into the topic distributions.

Summarize topic proportions
# Print summary statistics for topic proportions
df_topics.describe().round(4)
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4
count 11314.0000 11314.0000 11314.0000 11314.0000 11314.0000
mean 0.0369 0.8512 0.0362 0.0392 0.0365
std 0.0493 0.1436 0.0419 0.0523 0.0479
min 0.0066 0.0677 0.0032 0.0032 0.0032
25% 0.0210 0.8581 0.0209 0.0210 0.0209
50% 0.0264 0.8927 0.0264 0.0266 0.0264
75% 0.0339 0.9159 0.0339 0.0344 0.0338
max 0.9068 0.9734 0.6976 0.8386 0.7889
Thinking

The box plot and summary statistics reveal a clear problem: the LDA model with 5 topics is heavily dominated by Topic 1, which has a mean proportion of 0.8512 (85%). The other four topics have mean proportions around 0.036-0.039 (3.6-3.9%). This is a degenerate topic model - Topic 1 is essentially a "catch-all" topic that absorbs almost all document content, likely because the top words for Topic 1 are generic stop words ("the, to, of, and, is, that, in, it, you, for").

The top words for each topic also look problematic - Topics 0, 2, 3, and 4 have very obscure/specific words that don't look like meaningful topics.

This means the topic distribution plot shows that the model isn't producing meaningful topics. I should provide this analysis as a final answer since the user asked to "plot topic distribution across the dataset" and we have both the chart and the numeric summary. The plot has been produced and the statistics are available.

Actually, let me reconsider. The user's goal was to "plot topic distribution across the dataset." We have produced the plot and also shown the numeric summary. The output is sufficient to answer the original goal. I should provide Path A - a natural final answer based on observed outputs.

The key insight is that the topic distribution is extremely skewed, with Topic 1 dominating at ~85% mean proportion. This likely stems from not removing stop words before TF-IDF vectorization, causing one topic to capture common English words.

The topic distribution plot and summary statistics reveal a heavily skewed model. Topic 1 dominates with a mean proportion of 85.12% across all 11,314 documents, while Topics 0, 2, 3, and 4 each average only ~3.6–3.9%. The median document assigns ~89.3% of its content to Topic 1.

This is a strong signal that the LDA model is not producing meaningful, distinguishable topics. The root cause is visible in the top words:

  • Topic 1 top words: the, to, of, and, is, that, in, it, you, for — these are generic English stop words, meaning one topic is simply capturing common language rather than a coherent theme.
  • Topics 0, 2, 3, 4 have obscure, low-frequency words (e.g., rtrace, chuvashia, xman, melittin), suggesting they latched onto rare tokens instead of meaningful themes.

The fix would be to re-vectorize with stop word removal (e.g., TfidfVectorizer(stop_words='english')) and potentially adjust the number of topics, which would produce a more balanced and interpretable distribution.

show which topic dominates for a sample of 5 documents
Thinking

The user wants to see which topic dominates for a sample of 5 documents. I can use the df_topics DataFrame that's already in the notebook state and sample 5 documents, showing their topic proportions and which topic is dominant for each.

I'll sample 5 documents and show their topic proportions along with the dominant topic for each.

Sample documents and show topics
# Sample 5 documents and show topic proportions with dominant topic
sample = df_topics.sample(5, random_state=42)
sample["Dominant Topic"] = sample.values.argmax(axis=1)
sample
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Dominant Topic
7492 0.032837 0.868530 0.032838 0.032957 0.032838 1
3546 0.031865 0.872540 0.031865 0.031865 0.031865 1
5582 0.033896 0.864415 0.033896 0.033896 0.033896 1
4793 0.024202 0.903192 0.024202 0.024202 0.024202 1
3813 0.026652 0.893389 0.026653 0.026653 0.026653 1
Thinking

The user's goal was to show which topic dominates for a sample of 5 documents. The latest output (cell-6) already shows exactly this - 5 sampled documents with their topic proportions and dominant topic. All 5 documents have Topic 1 as their dominant topic, with proportions ranging from ~86% to ~90%.

This is sufficient to answer the goal. I should choose Path A.

All 5 sampled documents are dominated by Topic 1, with proportions ranging from ~86% to ~90%. The remaining topics each capture only ~2–3% per document, confirming the earlier observation that the current LDA model collapses nearly all documents into a single topic.

Doc IndexDominant TopicTopic 1 Prop
7492186.9%
3546187.3%
5582186.4%
4793190.3%
3813189.3%

This uniform dominance reinforces that re-fitting the model with stop word removal (e.g., TfidfVectorizer(stop_words='english')) would be necessary to produce meaningful, distinguishable topics.

Score cards for glm-5.1 for Topic Modeling with LDA in Python

Total score: 6/10

  • Task Completion: 2/2

    Excellent

    All major requested steps were attempted: TF-IDF vectorization, 5-topic LDA, topic distribution plot, and sample document topic dominance. The notebook also shows vocabulary size and topic-word lists.

  • Execution Correctness: 1/2

    Needs work

    The code appears runnable and produced outputs, but it uses LDA directly on TF-IDF, which is methodologically weak and led to degenerate topic behavior. The sample document table is also incomplete in the provided output.

  • Output Quality: 2/3

    Good

    The expected outputs are mostly present: vocabulary size, five topic word lists, a topic distribution plot, and a sample of five documents with dominant topic labels. However, the topic quality is poor and the final sample table is only partially shown.

  • Reasoning Quality: 1/2

    Needs work

    The assistant correctly noted that LDA is better suited to count data and that the topics were noisy, but the workflow still proceeded with TF-IDF without addressing the issue. The reasoning is mostly descriptive rather than corrective.

  • Reliability: 0/1

    Needs work

    The workflow is fragile because it applies LDA to TF-IDF and produces highly degenerate topics dominated by stopwords and rare tokens. This reduces confidence in the robustness of the analysis.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.