Topic Modeling with LDA in Python

Apply LDA topic modeling to a news article dataset, extract coherent topics, and visualize topic-word distributions.

This page shows an AI data analysis example. A workflow is a list of prompts (instructions) used to complete one analysis task step by step. We send the same prompt list to multiple AI models, then compare their notebook outputs and scores.

What this AI workflow does

This AI Data Analyst workflow loads the 20 Newsgroups training split from scikit-learn and converts the text into a TF-IDF document-term matrix while reporting the vocabulary size. It fits a 5-topic Latent Dirichlet Allocation (LDA) model and prints the top words for each topic to support interpretation. It visualizes topic prevalence across the corpus and identifies the dominant topic for a small sample of documents.

Who this example is for

This is for data analysts and NLP practitioners who want a reproducible, code-generating example of LDA topic modeling on a standard benchmark dataset. It helps users validate preprocessing choices, inspect topic-word distributions, and connect topics back to representative documents.

Expected analysis outcomes

These are the results the AI workflow is expected to generate.

  • TF-IDF vocabulary size report after filtering
  • Top 10 words per topic for 5 LDA topics
  • Bar chart of topic prevalence across the dataset
  • Table of 5 sample documents with dominant topic labels

Tools and libraries used

Main Python packages and tooling used to run this AI data analysis task.

  • scikit-learn
  • pandas
  • numpy
  • matplotlib
  • seaborn

Prompt sequence

This is the exact list of prompts used in this workflow. The same prompt sequence is sent to each model so outputs and scores can be compared fairly.

  1. 1load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size
  2. 2fit an LDA model with 5 topics and print top 10 words per topic
  3. 3plot topic distribution across the dataset
  4. 4show which topic dominates for a sample of 5 documents

Model Comparison

We compared several LLM models on the same analysis task. The table below shows their scores side by side. You can also open each model run to check the full conversation and notebook results.

Model SourceScoreTaskExecutionOutputReasoningReliabilityRun
gpt-oss:120b10/102/22/23/32/21/1Open gpt-oss:120b conversation
qwen3.5:397b10/102/22/23/32/21/1Open qwen3.5:397b conversation
gemma4:31b9/102/22/23/31/21/1Open gemma4:31b conversation
gpt-5.49/102/22/23/31/21/1Open gpt-5.4 conversation
qwen3-coder-next7/102/22/22/31/20/1Open qwen3-coder-next conversation
glm-5.16/102/21/22/31/20/1Open glm-5.1 conversation
How scoring works

Each run is graded on five dimensions:

  • Task Completion (0-2): how fully the workflow answered the requested analysis task and completed required steps.
  • Execution Correctness (0-2): whether generated code is correct, consistent, and likely runnable without major fixes.
  • Output Quality (0-3): how well tables, charts, and results match the expected analytical outcomes.
  • Reasoning Quality (0-2): clarity and correctness of interpretation, explanation, and conclusions.
  • Reliability (0-1): robustness of the run, including low hallucination risk and stable behavior.

The final score is the sum of these dimensions, so total score = 0-10. Higher values indicate more complete, correct, and reliable notebook results for this workflow.

AI Analysis with gemma4:31b

This run uses AI with a step-by-step list of prompts to perform the analysis task Topic Modeling with LDA in Python. Below, you can review the full conversation generated with gemma4:31b. After the conversation, we show detailed score cards for this run.

load 20 newsgroups data (20 Newsgroups from sklearn), vectorize with TF-IDF, and show vocabulary size
Thinking
  • Goal: Load 20 Newsgroups dataset, vectorize using TF-IDF, and show vocabulary size.
    • Constraint: Iterative loop, one code block per response, follow currency formatting (not applicable here), no end-to-end pipeline unless requested (though the user asked for a sequence, the system prompt says "do not provide a full end-to-end pipeline at once unless user explicitly asks" and "when a dataset is not yet loaded... first load and inspect only").

    • The user asked for load -> vectorize -> vocabulary size.

    • According to notebook rules: "When a dataset is not yet loaded in current notebook context, first load and inspect only: show shape, columns, and head()."

    • Even though the user asked for the whole sequence, I must follow the system instructions to be iterative.

    • Step 1: Load the 20 Newsgroups dataset and inspect it.

    • Step 2: Vectorize (TF-IDF).

    • Step 3: Vocabulary size.

    • Load fetch_20newsgroups from sklearn.datasets.

    • Show basic info: data size, target_names.

    • Wait, the user explicitly asked for the sequence "load... vectorize... show vocabulary size". The system prompt says "Do not provide a full end-to-end pipeline at once unless user explicitly asks."

    • The user did explicitly ask for this specific sequence. However, the "Notebook rules" specifically say: "When a dataset is not yet loaded... first load and inspect only: show shape, columns, and head()."

    • To be safe and follow the "iterative" and "notebook rules" closely, I will start by loading and inspecting the data. This ensures I don't skip the inspection step.

    • Refinement: If I do everything in one go, I violate the "one execution step at a time" and "do not combine first-load data inspection with plotting/modeling" rules.

    • So, Step 1: Load and inspect the raw text data.

I'll start by loading the 20 Newsgroups dataset and inspecting its contents to ensure everything is correct before we proceed to vectorization.

Load newsgroups dataset
from sklearn.datasets import fetch_20newsgroups

# Load the dataset
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Inspect the data
print(f"Number of documents: {len(newsgroups_train.data)}")
print(f"Number of categories: {len(newsgroups_train.target_names)}")
newsgroups_train.data[0]
Number of documents: 11314
Number of categories: 20
'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

The dataset is loaded with 11,314 documents across 20 categories. Now I will vectorize the text using TF-IDF to determine the vocabulary size.

Vectorize text data
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize the data
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(newsgroups_train.data)

# Get the vocabulary size
len(tfidf.get_feature_names_out())
101631
Thinking

The user wants to:

  1. Load 20 newsgroups data.
  2. Vectorize with TF-IDF.
  3. Show vocabulary size.

Looking at the notebook state:

  • cell-1: Loaded the data (fetch_20newsgroups). Output: 11,314 documents, 20 categories.
  • cell-2: Vectorized with TfidfVectorizer. Output: 101631.

The goal was to load, vectorize, and show vocabulary size. The outputs show the data was loaded, vectorization happened, and the resulting vocabulary size is 101,631.

All parts of the original user goal have been met.

Path A: Provide a final answer.

The 20 Newsgroups dataset was successfully loaded, consisting of 11,314 documents across 20 categories. After vectorizing the text using TF-IDF, the resulting vocabulary size is 101,631 unique features.

fit an LDA model with 5 topics and print top 10 words per topic
Thinking

The user wants to fit an LDA (Latent Dirichlet Allocation) model with 5 topics and print the top 10 words for each topic.

*   `newsgroups_train`: The 20 newsgroups dataset (train subset).
*   `tfidf_matrix`: The TF-IDF matrix.
*   `tfidf`: The TF-IDF vectorizer (which contains the vocabulary).

*   LDA typically works better with count-based frequencies (Bag of Words) rather than TF-IDF, as it's a generative probabilistic model based on word counts.
*   However, some implementations or users might use TF-IDF. Standard `sklearn.decomposition.LatentDirichletAllocation` expects non-negative values, so TF-IDF is technically acceptable, but `CountVectorizer` is the conventional choice for LDA.
*   Since the user previously asked for TF-IDF and now wants LDA, I have a choice: use the existing `tfidf_matrix` or create a `CountVectorizer` matrix.
*   Usually, for LDA "topics" to make sense as word distributions, raw counts are preferred.
*   However, the prompt asks to "fit an LDA model", and I already have a TF-IDF matrix.
*   Let's look at `sklearn.decomposition.LatentDirichletAllocation`. It takes the document-term matrix.
*   I'll use `CountVectorizer` because LDA is fundamentally a count-based model. Using TF-IDF often leads to poor topic coherence.

*   Step 1: Create a `CountVectorizer` matrix.
*   Step 2: Fit `LatentDirichletAllocation` with `n_components=5`.
*   Step 3: Extract the top 10 words per topic using the model's `components_` attribute and the vectorizer's `get_feature_names_out()`.

*   Wait, the user's *previous* flow was "load -> TF-IDF -> vocabulary size".
*   Now they want LDA.
*   To be most accurate to LDA's mathematical foundation, I should use counts. I will implement a `CountVectorizer` first or just use the `TfidfVectorizer` if I want to stay consistent with the previous step's output, but `CountVectorizer` is the "correct" way for LDA.
*   Actually, to keep the notebook efficient and follow the "one step at a time" rule, I will perform the LDA fitting. Since I need a count matrix for a standard LDA, I'll create it in the same block as the LDA fitting.

*   `from sklearn.decomposition import LatentDirichletAllocation`
*   `from sklearn.feature_extraction.text import CountVectorizer`
*   `cv = CountVectorizer(stop_words='english')` (Stop words are crucial for LDA topics to be meaningful).
*   `count_matrix = cv.fit_transform(newsgroups_train.data)`
*   `lda = LatentDirichletAllocation(n_components=5, random_state=42)`
*   `lda.fit(count_matrix)`
*   Extract top words.

*   One runnable Python code block.
*   No multiple blocks.
*   No `print` if not necessary, but for the top 10 words per topic, a formatted print is best.

*   `CountVectorizer` is needed because LDA is based on counts.
*   I'll include stop words to ensure the top words aren't just "the", "a", "of".

*   `lda.components_` has shape `(n_components, n_features)`.
*   For each component, sort indices of the highest values.
*   Map indices back to feature names.

LDA typically performs better with raw word counts rather than TF-IDF weights, so I will use a CountVectorizer (with stop words removed) to prepare the data before fitting the LDA model.

Fit LDA topic model
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Use CountVectorizer instead of TF-IDF for LDA
cv = CountVectorizer(stop_words='english')
count_matrix = cv.fit_transform(newsgroups_train.data)

# Fit LDA model with 5 topics
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(count_matrix)

# Extract and print the top 10 words per topic
words = cv.get_feature_names_out()
for topic_idx, topic_dist in enumerate(lda.components_):
    top_words_indices = topic_dist.argsort()[-10:]
    top_words = [words[i] for i in top_words_indices]
    print(f"Topic {topic_idx + 1}: {' '.join(top_words)}")
Topic 1: space windows key does com know new like use edu
Topic 2: 1t 34u 1d9 145 pl a86 b8f g9v max ax
Topic 3: turkish 12 20 11 15 25 db file 10 00
Topic 4: did said say time know like think just don people
Topic 5: motif like christ problem does widget use window god jesus
Thinking

The user wants to fit an LDA model with 5 topics and print the top 10 words per topic using the 20 Newsgroups dataset.

Looking at the provided notebook state:

  • cell-3 executed the LDA fitting with 5 topics using CountVectorizer.
  • output-3 contains the result:
    • Topic 1: space windows key does com know new like use edu
    • Topic 2: 1t 34u 1d9 145 pl a86 b8f g9v max ax
    • Topic 3: turkish 12 20 11 15 25 db file 10 00
    • Topic 4: did said say time know like think just don people
    • Topic 5: motif like christ problem does widget use window god jesus

The original goal was "fit an LDA model with 5 topics and print top 10 words per topic". The output of cell-3 exactly satisfies this request.

Analysis of results:

  • Topic 1 seems to be related to technology/computing (windows, key, com, edu).
  • Topic 2 and 3 contain many alphanumeric codes/numbers, suggesting they might be capturing noise or specific structured data (like email headers or logs that weren't fully cleaned).
  • Topic 4 contains generic conversational words (did, said, say, think), indicating it's picking up common speech patterns.
  • Topic 5 appears to be related to religion or philosophy (christ, god, jesus).

Since the goal has been met and the output is visible, I should choose Path A.

Final Answer Construction:

  • Summarize the result: The LDA model was fitted with 5 topics.
  • Provide the extracted top words for each topic.
  • Interpret briefly: Topic 1 relates to technology, Topic 5 to religion, and Topic 4 to general conversation, while Topics 2 and 3 appear to contain noise or technical identifiers.

The LDA model has been fitted with 5 topics, and the top 10 words for each are as follows:

  • Topic 1: space, windows, key, does, com, know, new, like, use, edu (Technology/Computing)
  • Topic 2: 1t, 34u, 1d9, 145, pl, a86, b8f, g9v, max, ax (Noise/Technical Identifiers)
  • Topic 3: turkish, 12, 20, 11, 15, 25, db, file, 10, 00 (Noise/Structured Data)
  • Topic 4: did, said, say, time, know, like, think, just, don, people (General Conversation)
  • Topic 5: motif, like, christ, problem, does, widget, use, window, god, jesus (Religion/Philosophy)
plot topic distribution across the dataset
Thinking

The user wants to plot the topic distribution across the dataset. I have already:

  1. Loaded newsgroups_train.
  2. Vectorized it using CountVectorizer (count_matrix).
  3. Fitted an LDA model with 5 components (lda).

To plot the topic distribution, I need to:

  1. Transform the documents into the topic space using lda.transform(count_matrix).
  2. This will result in a matrix where each row is a document and each column is the probability/weight of that document belonging to a specific topic.
  3. To see the "distribution across the dataset", I can either: a. Plot a histogram/bar chart of the most dominant topic per document. b. Calculate the average weight of each topic across all documents and plot that. c. Plot a distribution of weights for each topic.

The most intuitive way to show "distribution across the dataset" is to find the dominant topic for each document and plot the counts of these dominant topics.

Plan:

  1. Transform the data into topic distributions.
  2. Assign each document to its most probable topic.
  3. Plot a bar chart showing the number of documents assigned to each of the 5 topics.

Constraint check:

  • Use seaborn/matplotlib by default.
  • One chart per step.
  • Simple variable names.
  • No $ for money (not applicable here).

Step:

  • Transform data $\rightarrow$ get dominant topics $\rightarrow$ plot counts.

To visualize the topic distribution, I'll transform the documents into the topic space and plot how many documents are primarily associated with each of the 5 topics.

Plot dominant topic distribution
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Transform the documents to get the topic distribution for each
topic_distributions = lda.transform(count_matrix)

# Assign each document to its most dominant topic
dominant_topics = np.argmax(topic_distributions, axis=1)

# Create a DataFrame for easy plotting
df_topics = pd.DataFrame({'Topic': dominant_topics + 1})

# Plot the distribution
plt.figure(figsize=(10, 6))
sns.countplot(data=df_topics, x='Topic', palette='viridis')
plt.title('Distribution of Dominant Topics across the Dataset')
plt.xlabel('Topic Number')
plt.ylabel('Number of Documents')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
/var/folders/pp/x90pfm0140d4glxshn77bthh0000gn/T/ipykernel_6797/944452128.py:17: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(data=df_topics, x='Topic', palette='viridis')
Thinking

The user wants to "plot topic distribution across the dataset". The notebook state shows:

  • cell-3 fits an LDA model with 5 topics.
  • cell-4 calculates the dominant topic for each document and plots a count plot using sns.countplot.
  • output-4 shows a figure object <Figure size 1000x600 with 1 Axes>.

Sufficiency heuristics:

  • "If the latest output is chart-only/visual-only and does not include underlying numeric/text values, choose Path B first to print the values used in the chart."
  • The output of cell-4 is a plot. I can't see the exact counts of the documents per topic from the provided notebook state (only the figure object is mentioned). To provide a complete answer, I need the numeric distribution.

Therefore, I should choose Path B to print the value counts of the dominant topics.

Insight from current outputs: The LDA model has identified 5 topics. The plot in cell-4 visualizes how documents are distributed among these topics based on their highest probability assignment. However, the exact numerical distribution is not explicitly printed in the outputs.

Next step: Print the counts for each topic.

The distribution of documents across the five topics has been visualized. To provide a precise breakdown of how many documents are assigned to each topic, I will print the numerical counts.

Count topic frequencies
# Display the actual counts for each dominant topic
topic_counts = df_topics['Topic'].value_counts().sort_index()
topic_counts
Topic
1    5201
2      21
3     340
4    4860
5     892
Name: count, dtype: int64
Thinking

The user's original goal was to "plot topic distribution across the dataset". Looking at the notebook state:

  • Cell 4 created a countplot showing the distribution of dominant topics.
  • Cell 5 provided the exact numeric counts for each topic:
    • Topic 1: 5201
    • Topic 2: 21
    • Topic 3: 340
    • Topic 4: 4860
    • Topic 5: 892

The goal was to "plot" the distribution. A chart was generated and the numeric values supporting that chart have been displayed. The outputs are sufficient to answer the goal.

Insights:

  • Topic 1 and Topic 4 are the most dominant, together accounting for the vast majority of the documents (~10,000 out of 11,314).
  • Topic 2 is almost negligible with only 21 documents.
  • The distribution is highly skewed toward two main topics.

Since the chart exists and the numeric data is present, Path A is appropriate.

Final Answer check:

  • Natural final answer? Yes.
  • Includes <!--final answer-->? Yes.
  • No Python code? Yes.
  • No chain-of-thought? Yes.
  • No headings? Yes.
  • Focused and concise? Yes.

The topic distribution shows a heavy concentration in two main areas, with Topic 1 (5,201 documents) and Topic 4 (4,860 documents) dominating the dataset. In contrast, Topic 2 is nearly nonexistent with only 21 documents, while Topic 3 (340) and Topic 5 (892) represent smaller clusters. This indicates that the LDA model has captured two very broad themes and a few much more niche topics.

show which topic dominates for a sample of 5 documents
Thinking

The user wants to see which topic dominates for a sample of 5 documents.

Looking at the notebook state:

  • newsgroups_train.data contains the original text documents.
  • df_topics contains the dominant topic for each document (1-indexed).
  • lda is the fitted LDA model.
  • count_matrix is the vectorized data used for lda.transform.

To achieve this, I need to:

  1. Select 5 random indices from the dataset.
  2. For these indices, get the original text and the corresponding dominant topic from df_topics.
  3. Display them together.

Plan:

  • Use np.random.choice or just a slice/sample of df_topics.
  • Create a small DataFrame or list showing the document text (truncated for readability) and the assigned topic.

Wait, df_topics only has the 'Topic' column. I should combine it with the text from newsgroups_train.data.

Step: Create a DataFrame with a sample of documents and their assigned topics.

To see which topic dominates for a sample of documents, I'll pair a random selection of the original text with their assigned dominant topics.

Sample random document topics
import random

# Select 5 random indices
sample_indices = random.sample(range(len(newsgroups_train.data)), 5)

# Create a list to store results
sample_data = []
for idx in sample_indices:
    sample_data.append({
        'Document Text': newsgroups_train.data[idx][:150] + "...", # Truncate for readability
        'Dominant Topic': df_topics.iloc[idx]['Topic']
    })

# Display as a DataFrame
pd.DataFrame(sample_data)
Document Text Dominant Topic
0 :You are loosing.\n\n"Loosing"?  Well, I'll av... 4
1 I didn't say it NEVER mentioned Satan, I said ... 4
2 Since someone brought up sports radio, howabou... 1
3 \nTommy Frandsen's VGAlib (which is probably w... 1
4 \n   It's worse than that -- there *is* no suc... 5
Thinking

The user wants to show which topic dominates for a sample of 5 documents. Looking at the notebook state:

  • [cell-6] executed a random sampling of 5 documents.
  • [output-6] displays a DataFrame with 'Document Text' (truncated) and 'Dominant Topic'.
  • The output shows:
    • Document 0: Topic 4
    • Document 1: Topic 4
    • Document 2: Topic 1
    • Document 3: Topic 1
    • Document 4: Topic 5

The goal "show which topic dominates for a sample of 5 documents" has been achieved. The output explicitly lists the dominant topic for 5 random samples.

Decision: Path A. Provide a final answer based on these outputs.

For the sampled documents, the dominant topics are as follows:

  • Document 0 (discussing "loosing") is dominated by Topic 4.
  • Document 1 (discussing "Satan") is dominated by Topic 4.
  • Document 2 (discussing "sports radio") is dominated by Topic 1.
  • Document 3 (discussing "VGAlib") is dominated by Topic 1.
  • Document 4 (discussing a "lack of success") is dominated by Topic 5.

In this specific sample, Topic 1 and Topic 4 are the most frequent, appearing twice each.

Score cards for gemma4:31b for Topic Modeling with LDA in Python

Total score: 9/10

  • Task Completion: 2/2

    Excellent

    All major requested steps were completed: loading 20 Newsgroups, vectorizing, fitting 5-topic LDA, plotting topic prevalence, and showing sample documents with dominant topics.

  • Execution Correctness: 2/2

    Excellent

    The provided code is coherent and likely runnable. The only issue shown is a seaborn FutureWarning, which does not break execution.

  • Output Quality: 3/3

    Excellent

    Outputs match the expected outcomes semantically: vocabulary size is shown, five topic word lists are printed, a topic distribution chart is produced, and a 5-document table with dominant topics is displayed.

  • Reasoning Quality: 1/2

    Needs work

    The workflow includes some correct reasoning, such as using CountVectorizer for LDA, but it is verbose and internally inconsistent about iterative constraints versus completing the full task.

  • Reliability: 1/1

    Excellent

    The analysis is mostly consistent and grounded in the notebook outputs, with no major hallucinations affecting the final results.

Try MLJAR Studio

Run the same type of AI-powered data analysis on your own datasets with conversational notebooks in MLJAR Studio.

Explore More AI Analysis Examples

Discover additional workflows across categories. Each example includes prompts, conversation outputs, and model-level scoring so you can compare approaches and results.