Overview
While text classification requires labeled data, text clustering and topic modeling discover patterns and themes in unlabeled document collections. This chapter explores how to group similar documents together and extract meaningful topics using modern embedding-based approaches. You’ll learn how to build a complete clustering pipeline using embeddings, dimensionality reduction, and clustering algorithms, then extend it to topic modeling with BERTopic - a modular framework that combines the best of classical and modern NLP techniques.What You’ll Learn
Document Embeddings
Convert text into high-quality vector representations using sentence transformers
Dimensionality Reduction
Use UMAP to reduce embedding dimensions while preserving semantic structure
Use Cases
Text clustering and topic modeling power numerous applications:- Research Analysis: Discover themes in academic papers, patents, or scientific literature
- Customer Feedback: Identify common themes in product reviews or support tickets
- Content Organization: Automatically categorize news articles, blog posts, or documents
- Social Media Monitoring: Track trending topics in tweets, forums, or discussions
- Document Discovery: Enable exploratory search in large text collections
- Knowledge Management: Organize and navigate corporate knowledge bases
Dataset: ArXiv NLP Papers
We’ll work with abstracts from ArXiv papers in the Computation and Language (cs.CL) category - real research papers from the NLP community.A Common Pipeline for Text Clustering
Text clustering typically follows a three-step pipeline:Step 1: Embedding Documents
First, we convert each abstract into a numerical vector that captures its semantic meaning.We use
thenlper/gte-small, a compact but powerful embedding model. It balances quality and speed, making it suitable for large document collections.Step 2: Reducing the Dimensionality of Embeddings
High-dimensional embeddings (384 dimensions) can be difficult for clustering algorithms. We use UMAP (Uniform Manifold Approximation and Projection) to reduce dimensions while preserving semantic relationships.Step 3: Cluster the Reduced Embeddings
Now we apply HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to discover clusters.Why HDBSCAN?
- Doesn’t require specifying the number of clusters upfront
- Can identify clusters of varying densities
- Automatically identifies outliers (labeled as -1)
- Works well with the output of UMAP
Inspecting the Clusters
Let’s manually examine the first three documents in cluster 0:Visualizing Clusters
To visualize our clusters, we reduce the embeddings to 2 dimensions:Outliers (cluster -1) represent papers that don’t fit well into any cluster. This is normal and often represents either very unique papers or papers that bridge multiple topics.
From Text Clustering to Topic Modeling
While clustering groups similar documents, topic modeling goes further by:- Identifying what makes each cluster unique
- Extracting representative keywords
- Providing interpretable topic descriptions
BERTopic: A Modular Topic Modeling Framework
BERTopic combines our clustering pipeline with topic representation techniques to create interpretable topics.Exploring Topics
View all topics with their counts and representations:| Topic | Count | Name | Representation |
|---|---|---|---|
| -1 | 14520 | -1_the_of_and_to | [the, of, and, to, in, we, that, language…] |
| 0 | 2290 | 0_speech_asr_recognition_end | [speech, asr, recognition, end, acoustic…] |
| 1 | 1403 | 1_medical_clinical_biomedical_patient | [medical, clinical, biomedical, patient…] |
| 2 | 1156 | 2_sentiment_aspect_analysis_reviews | [sentiment, aspect, analysis, reviews…] |
| 3 | 986 | 3_translation_nmt_machine_neural | [translation, nmt, machine, neural…] |
Examining Topic Keywords
Get the top 10 keywords for a specific topic with their c-TF-IDF weights:c-TF-IDF (class-based TF-IDF) is like TF-IDF but treats each cluster as a single document. It identifies words that are frequent within a topic but rare across other topics, making topic representations more distinctive.
Searching for Topics
Find topics related to a search term:Visualizations
BERTopic provides rich interactive visualizations to explore topics.Visualize Documents
Create an interactive plot showing all documents and their topics:- Hover over points to see document titles
- Zoom into specific regions
- Filter by topic
- Explore the relationship between different topics
Additional Visualizations
Representation Models
BERTopic’s modularity shines when using representation models to improve topic descriptions beyond c-TF-IDF.Available Representation Models
KeyBERT-Inspired
KeyBERT-Inspired
Uses embeddings to select the most representative keywords for each topic.
MaximalMarginalRelevance
MaximalMarginalRelevance
Selects diverse keywords that are relevant but not redundant.
OpenAI / Cohere
OpenAI / Cohere
Uses large language models to generate descriptive topic labels.
LangChain
LangChain
Integrates with LangChain for custom LLM-based representations.
Updating Topics After Training
You can update topic representations after training, allowing quick iteration:Practical Applications
Research Paper Organization
Research Paper Organization
Scenario: A research lab has thousands of papers and wants to organize them by theme.Solution:
- Extract paper abstracts
- Generate embeddings and cluster using BERTopic
- Use OpenAI representation model for human-readable topic names
- Create interactive visualizations for exploration
- Build a search interface using topic assignments
Customer Feedback Analysis
Customer Feedback Analysis
Scenario: An e-commerce company receives thousands of product reviews daily.Solution:
- Collect review text
- Apply BERTopic to discover common themes
- Track topic prevalence over time
- Alert teams when new issues emerge (new topics)
- Generate automated reports on customer concerns
Content Recommendation
Content Recommendation
Scenario: A news website wants to recommend related articles.Solution:
- Model topics across all articles
- Assign new articles to topics in real-time
- Recommend articles from the same or related topics
- Use topic hierarchy for broader recommendations
Key Differences: Clustering vs Topic Modeling
| Aspect | Text Clustering | Topic Modeling |
|---|---|---|
| Output | Document groups | Document groups + topic descriptions |
| Interpretability | Requires manual inspection | Automatic keyword extraction |
| Use Case | Organization, deduplication | Analysis, exploration, search |
| Examples | HDBSCAN, K-Means | LDA, NMF, BERTopic |
| Flexibility | Group assignment | Group + representation |
Performance Considerations
Scalability Tips:
-
Large datasets (>100K documents):
- Use approximate nearest neighbors for UMAP
- Consider batching embeddings
- Use
low_memory=Truein HDBSCAN
-
Speed optimization:
- Use smaller embedding models (e.g.,
all-MiniLM-L6-v2) - Pre-compute and save embeddings
- Reduce UMAP dimensions to 3-5 instead of 5-10
- Use smaller embedding models (e.g.,
-
Quality improvement:
- Use larger embedding models (e.g.,
all-mpnet-base-v2) - Increase
min_cluster_sizefor more coherent topics - Experiment with different UMAP parameters
- Use larger embedding models (e.g.,
Key Takeaways
Embeddings are foundational
High-quality embeddings are crucial for both clustering and topic modeling
Pipeline matters
The combination of embedding → dimensionality reduction → clustering works well across domains
Representation models enhance topics
Use LLMs or specialized models to generate better topic descriptions
Next Steps
Now that you understand text clustering and topic modeling, you can:- Apply these techniques to your own document collections
- Experiment with different embedding models and parameters
- Build search and recommendation systems using topics
- Track topic evolution over time in dynamic corpora
Try the notebook yourself:
