Nlp
Topic Modeling with LDA in Python
Apply LDA topic modeling to a news article dataset, extract coherent topics, and visualize topic-word distributions.
What
This AI Data Analyst workflow loads the 20 Newsgroups training split from scikit-learn and converts the text into a TF-IDF document-term matrix while reporting the vocabulary size. It fits a 5-topic Latent Dirichlet Allocation (LDA) model and prints the top words for each topic to support interpretation. It visualizes topic prevalence across the corpus and identifies the dominant topic for a small sample of documents.
Who
This is for data analysts and NLP practitioners who want a reproducible, code-generating example of LDA topic modeling on a standard benchmark dataset. It helps users validate preprocessing choices, inspect topic-word distributions, and connect topics back to representative documents.
Tools
- scikit-learn
- pandas
- numpy
- matplotlib
- seaborn
Outcomes
- TF-IDF vocabulary size report after filtering
- Top 10 words per topic for 5 LDA topics
- Bar chart of topic prevalence across the dataset
- Table of 5 sample documents with dominant topic labels
Quality Score
8/10
Last scored: Apr 7, 2026
Task Completion: 1/2
Needs workAll prompted steps were executed (load 20NG, vectorize, fit 5-topic LDA, plot topic prevalence, show dominant topic for 5 docs). However, the expected outcome of vocabulary size ~50k after filtering was not met (reported ~130k), and topics are not clearly coherent/interpretable as requested.
Execution Correctness: 2/2
ExcellentCode uses standard sklearn APIs correctly (fetch_20newsgroups, TfidfVectorizer/CountVectorizer, LatentDirichletAllocation, transform, plotting) and outputs are consistent with the code shown.
Output Quality: 2/3
GoodOutputs include vocabulary size, top-10 word lists for 5 topics, a topic-count table and bar chart, and a 5-row table with document preview + dominant topic/probability. But the vocabulary size deviates substantially from the expected ~50k and several topics are dominated by header/noise tokens, reducing semantic match to 'coherent topics'.
Reasoning Quality: 2/2
ExcellentReasoning appropriately notes that LDA works better with counts than TF-IDF and correctly interprets noisy topics and skewed topic prevalence based on the printed top words and counts.
Reliability: 1/1
ExcellentWorkflow is reasonably robust and reproducible (fixed random_state/seed, uses common libraries); no hallucinated functions or unsafe steps, though preprocessing choices are minimal and lead to noisy topics.