Topic Modeling with LDA in Python

Apply LDA topic modeling to a news article dataset, extract coherent topics, and visualize topic-word distributions.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the 20 Newsgroups training split from scikit-learn and converts the text into a TF-IDF document-term matrix while reporting the vocabulary size. It fits a 5-topic Latent Dirichlet Allocation (LDA) model and prints the top words for each topic to support interpretation. It visualizes topic prevalence across the corpus and identifies the dominant topic for a small sample of documents.

Who this example is for

This is for data analysts and NLP practitioners who want a reproducible, code-generating example of LDA topic modeling on a standard benchmark dataset. It helps users validate preprocessing choices, inspect topic-word distributions, and connect topics back to representative documents.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • TF-IDF vocabulary size report after filtering
  • Top 10 words per topic for 5 LDA topics
  • Bar chart of topic prevalence across the dataset
  • Table of 5 sample documents with dominant topic labels

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size
  2. 2fit an LDA model with 5 topics and print top 10 words per topic
  3. 3plot topic distribution across the dataset
  4. 4show which topic dominates for a sample of 5 documents

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
gemma4:31b9/102/22/23/31/21/1Open gemma4:31b conversation
gpt-5.49/102/22/23/31/21/1Open gpt-5.4 conversation
qwen3-coder-next7/102/22/22/31/20/1Open qwen3-coder-next conversation
glm-5.16/102/21/22/31/20/1Open glm-5.1 conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gpt-5.4

This run uses AI with a step-by-step list of prompts to perform the analysis task Topic Modeling with LDA in Python. Below, you can review the full conversation generated with gpt-5.4. After the conversation, we show detailed score cards for this run.

load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size
Thinking

Planning dataset handling

I need to take this one step at a time since the user is asking me to load a dataset, vectorize it with TF-IDF, and show the vocabulary size. The first step should be to load and inspect the dataset. I’ll show the shape, columns, and head of the data. It seems the dataset might not be a DataFrame necessarily, but I should adapt by creating one with target and text. I think I'll load the training split first for initial inspection.Clarifying the vectorization process

The user wants to vectorize, but I need to focus on the first step only. First, I’ll load the dataset into a DataFrame so we can inspect it before vectorizing. My code will fetch the 20 newsgroups for the training subset without removing headers, footers, or quotes—since I shouldn't make assumptions. I’ll show the shape, columns, and head of the DataFrame. While I want to avoid multiple tables, it seems acceptable to display the shape and columns as text, then show the head as the final output.Organizing dataset displays

I’m thinking of using a plain dictionary to create a summary and then include the DataFrame's head. In a notebook, dictionaries display as a representation rather than a table—still, that’s okay. I’ll create a summary dictionary with the shape, column list, and class count. For the final output, I’ll display the dictionary summary followed by the DataFrame’s head. This way, I can share both the summary and the table in one go, which seems efficient.

Let’s start by loading the 20 Newsgroups training set into a simple DataFrame and inspecting it before vectorizing.

Load newsgroups dataset
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset="train")

df = pd.DataFrame({
    "text": news.data,
    "target": news.target,
    "target_name": [news.target_names[i] for i in news.target]
})

summary = {
    "shape": df.shape,
    "columns": df.columns.tolist(),
    "n_classes": len(news.target_names)
}
display(summary)

df.head()
{'shape': (11314, 3),
 'columns': ['text', 'target', 'target_name'],
 'n_classes': 20}
text target target_name
0 From: lerxst@wam.umd.edu (where's my thing)\nS... 7 rec.autos
1 From: guykuo@carson.u.washington.edu (Guy Kuo)... 4 comp.sys.mac.hardware
2 From: twillis@ec.ecn.purdue.edu (Thomas E Will... 4 comp.sys.mac.hardware
3 From: jgreen@amber (Joe Green)\nSubject: Re: W... 1 comp.graphics
4 From: jcm@head-cfa.harvard.edu (Jonathan McDow... 14 sci.space

The data is in place for the text column you need: there are 11,314 training documents, 20 classes, and the raw text is in df["text"]. The missing piece for your original goal is the TF-IDF vectorization and the vocabulary size.

Compute TF IDF features
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(df["text"])

result = {
    "tfidf_matrix_shape": X_tfidf.shape,
    "vocabulary_size": len(tfidf.vocabulary_)
}
result
{'tfidf_matrix_shape': (11314, 130107), 'vocabulary_size': 130107}

The 20 Newsgroups training split was loaded from sklearn and contains 11,314 documents across 20 categories.

A TF-IDF vectorizer was then fit on the raw text in df["text"], producing a sparse document-term matrix with shape (11314, 130107). That means:

  • 11,314 rows = documents
  • 130,107 columns = unique terms in the learned TF-IDF vocabulary

So the vocabulary size is 130,107.

fit an LDA model with 5 topics and print top 10 words per topic
Thinking

Considering LDA implementation

I need to produce a single code block for the user’s request to fit an LDA model with 5 topics and display the top 10 words for each topic. While LDA usually benefits from count data rather than TF-IDF, the notebook state already has TF-IDF set up. I should mention that counts are best practice for LDA, so I’ll work with that. I’ll craft a concise explanation for clarity as well.Preparing LDA code

I'm working on the code for the LDA model, and I'll be importing CountVectorizer and LDA. The user didn’t specify some parameters, so I shouldn't assume anything. I’m considering using max features or the full vocabulary, though that might be slow with my documents. I want to include settings like stop_words='english' and min_df/max_df settings to improve topic quality. Ultimately, I’ll display topics_df, ensuring it’s in one code block since they asked for the top 10 words per topic.

For topic modeling, LDA usually works better on term counts than on TF-IDF weights, so I’d fit it on a fresh CountVectorizer matrix here.

Fit LDA topics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd

# Count features for LDA
cv = CountVectorizer(stop_words="english", max_df=0.95, min_df=5)
X_count = cv.fit_transform(df["text"])

# Fit 5-topic LDA
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X_count)

# Top 10 words per topic
feature_names = cv.get_feature_names_out()
rows = []

for topic_idx, topic in enumerate(lda.components_):
    top_idx = topic.argsort()[-10:][::-1]
    top_words = [feature_names[i] for i in top_idx]
    rows.append({
        "topic": f"Topic {topic_idx + 1}",
        "top_10_words": ", ".join(top_words)
    })

topics_df = pd.DataFrame(rows)
topics_df
topic top_10_words
0 Topic 1 edu, god, jesus, university, team, writes, gam...
1 Topic 2 ax, max, g9v, b8f, a86, 145, pl, 1d9, 0d, 34u
2 Topic 3 edu, com, posting, host, nntp, university, art...
3 Topic 4 space, file, edu, use, com, nasa, program, inf...
4 Topic 5 edu, com, people, writes, don, article, just, ...

The 5-topic LDA model did run and produced top words for each topic.

From the observed output, the topics are:

  • Topic 1: edu, god, jesus, university, team, writes, gam...
  • Topic 2: ax, max, g9v, b8f, a86, 145, pl, 1d9, 0d, 34u
  • Topic 3: edu, com, posting, host, nntp, university, art...
  • Topic 4: space, file, edu, use, com, nasa, program, inf...
  • Topic 5: edu, com, people, writes, don, article, just, ...

A couple of concrete takeaways from these results:

  • Topic 4 is clearly the most interpretable, centered on space/NASA/computing-related language.
  • Topic 1 appears mixed, with religion terms like god and jesus, but also more general forum words.
  • Topic 2 looks noisy and dominated by token fragments such as ax, g9v, and b8f, which suggests the raw text still contains artifacts that the vectorizer treated as words.
  • Topics 3 and 5 are influenced by general Usenet/posting metadata terms like edu, com, posting, host, nntp, writes, and article, so they are less semantically clean.

So the requested LDA model with 5 topics was fit, and the top words per topic were printed. The output indicates that some topics are meaningful, but several are still dominated by email/forum header vocabulary and text artifacts.

plot topic distribution across the dataset

The next useful step is to get each document’s topic proportions from the fitted LDA model, assign the dominant topic, and plot how many documents fall into each topic.

Plot topic distribution
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Document-topic probabilities
doc_topic = lda.transform(X_count)

# Dominant topic per document
dominant_topic = doc_topic.argmax(axis=1) + 1

topic_counts = (
    pd.Series(dominant_topic, name="topic")
    .value_counts()
    .sort_index()
    .rename_axis("topic")
    .reset_index(name="document_count")
)

plt.figure(figsize=(8, 4))
sns.barplot(data=topic_counts, x="topic", y="document_count", color="steelblue")
plt.title("LDA Topic Distribution Across Documents")
plt.xlabel("Topic")
plt.ylabel("Number of Documents")
plt.tight_layout()

topic_counts
topic document_count
0 1 1736
1 2 24
2 3 4907
3 4 1272
4 5 3375

The topic distribution is quite uneven across the 11,314 documents.

Document counts by dominant topic:

  • Topic 1: 1,736
  • Topic 2: 24
  • Topic 3: 4,907
  • Topic 4: 1,272
  • Topic 5: 3,375

What this means from the plot:

  • Topic 3 is the largest cluster, covering about 43% of the dataset.
  • Topic 5 is the second largest, at about 30%.
  • Topic 1 and Topic 4 are mid-sized groups.
  • Topic 2 is almost absent, with only 24 documents, which matches the earlier top-word output showing noisy token fragments rather than a coherent theme.

So the plot shows that most documents are being assigned to Topics 3 and 5, while Topic 2 is a very small outlier topic.

show which topic dominates for a sample of 5 documents
Thinking

Creating a sample table

I need to avoid using multiple tables, so I’ll create a copy using sample indices from df.sample(5, random_state=42). I’ll make sure to include the first 300 characters of text, and check that the dominant_topic exists from the previous cell. It would be wise to recompute it from doc_topic if it’s available. For my table, I’ll include index, target_name, dominant_topic, dominant_topic_prob, and text_preview, and then sort it by the original index. I can mention that if needed, we can inspect the full topic mixture, but I don't think it’s necessary.

Here’s a compact view of 5 sample documents with their dominant LDA topic and that topic’s probability.

Create topic sample preview
sample_idx = df.sample(5, random_state=42).index

sample_docs = df.loc[sample_idx, ["target_name", "text"]].copy()
sample_docs["dominant_topic"] = dominant_topic[sample_idx]
sample_docs["dominant_topic_prob"] = doc_topic[sample_idx].max(axis=1).round(3)
sample_docs["text_preview"] = sample_docs["text"].str.replace(r"\s+", " ", regex=True).str.slice(0, 180)

sample_docs = sample_docs.drop(columns="text").reset_index(names="doc_id")
sample_docs
doc_id target_name dominant_topic dominant_topic_prob text_preview
0 7492 comp.sys.mac.hardware 3 0.852 From: rrn@po.CWRU.Edu (Robert R. Novitskey) Su...
1 3546 comp.os.ms-windows.misc 3 0.934 From: ardie@ux1.cso.uiuc.edu (Ardie Mack) Subj...
2 5582 misc.forsale 3 0.965 From: tsa@cellar.org (The Silent Assassin) Sub...
3 4793 talk.politics.guns 5 0.795 From: guy@idacom.hp.com (Guy M. Trotter) Subje...
4 3813 rec.sport.hockey 1 0.723 From: jwodzia@fadel.uucp (john wodziak) Subjec...
Thinking

Crafting concise insights

I need to follow path A since the latest output aligns with the goal of showing which topic dominates within a sample of five documents. I should provide a clear final answer based on the observed outputs, including a summary instead of a verbatim table. I might mention the dominant topic probability for each document and connect it to the target name, summarizing the observed outputs and providing previews of the document IDs. Let’s make it concise!

For the 5 sampled documents, the dominant topics are:

  • Doc 7492 (comp.sys.mac.hardware) → Topic 3 with probability 0.852
  • Doc 3546 (comp.os.ms-windows.misc) → Topic 3 with probability 0.934
  • Doc 5582 (misc.forsale) → Topic 3 with probability 0.965
  • Doc 4793 (talk.politics.guns) → Topic 5 with probability 0.795
  • Doc 3813 (rec.sport.hockey) → Topic 1 with probability 0.723

What this shows:

  • Topic 3 dominates 3 of the 5 sampled documents, and with fairly strong confidence in each case.
  • Topic 5 appears for the politics-related sample.
  • Topic 1 appears for the hockey-related sample.
  • The highest-confidence assignment in this sample is the misc.forsale document to Topic 3 at 0.965, suggesting that document is very strongly associated with that topic.

So, in this small sample, Topic 3 is the most common dominant topic.

Score cards for gpt-5.4 for Topic Modeling with LDA in Python

Total score: 9/10

  • Task Completion: 2/2

    Excellent

    All major requested steps were completed: data loading, TF-IDF vocabulary size, 5-topic LDA with top words, topic distribution plot, and a 5-document dominant-topic table.

  • Execution Correctness: 2/2

    Excellent

    The provided code is coherent and likely runnable as shown. It uses valid sklearn, pandas, and plotting calls without obvious syntax or logic errors.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: vocabulary size is shown, five topic word lists are printed, a bar chart is produced, and five sample documents are labeled with dominant topics.

  • Reasoning Quality: 1/2

    Needs work

    The workflow includes some correct reasoning, such as noting that LDA works better on counts than TF-IDF, but it also mixes in unnecessary commentary and does not fully align with the prompt's TF-IDF instruction.

  • Reliability: 1/1

    Excellent

    The workflow is reasonably consistent and produces interpretable results, though it deviates from the prompt by fitting LDA on CountVectorizer rather than the TF-IDF representation.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.