Skip to main content
Open In Colab

Overview

While text classification requires labeled data, text clustering and topic modeling discover patterns and themes in unlabeled document collections. This chapter explores how to group similar documents together and extract meaningful topics using modern embedding-based approaches. You’ll learn how to build a complete clustering pipeline using embeddings, dimensionality reduction, and clustering algorithms, then extend it to topic modeling with BERTopic - a modular framework that combines the best of classical and modern NLP techniques.

What You’ll Learn

1

Document Embeddings

Convert text into high-quality vector representations using sentence transformers
2

Dimensionality Reduction

Use UMAP to reduce embedding dimensions while preserving semantic structure
3

Clustering Algorithms

Apply HDBSCAN to discover document clusters of varying densities
4

Topic Modeling with BERTopic

Extract interpretable topics using c-TF-IDF and representation models
5

Visualization & Exploration

Visualize document clusters and topic relationships interactively

Use Cases

Text clustering and topic modeling power numerous applications:
  • Research Analysis: Discover themes in academic papers, patents, or scientific literature
  • Customer Feedback: Identify common themes in product reviews or support tickets
  • Content Organization: Automatically categorize news articles, blog posts, or documents
  • Social Media Monitoring: Track trending topics in tweets, forums, or discussions
  • Document Discovery: Enable exploratory search in large text collections
  • Knowledge Management: Organize and navigate corporate knowledge bases

Dataset: ArXiv NLP Papers

We’ll work with abstracts from ArXiv papers in the Computation and Language (cs.CL) category - real research papers from the NLP community.
# Load data from huggingface
from datasets import load_dataset
dataset = load_dataset("maartengr/arxiv_nlp")["train"]

# Extract metadata
abstracts = list(dataset["Abstracts"])
titles = list(dataset["Titles"])
This dataset contains 44,949 abstracts from NLP research papers, making it ideal for discovering research themes and trends.

A Common Pipeline for Text Clustering

Text clustering typically follows a three-step pipeline:
1

1. Embed Documents

Convert text into dense vector representations
2

2. Reduce Dimensionality

Reduce high-dimensional embeddings to lower dimensions for clustering
3

3. Cluster Documents

Group similar documents using clustering algorithms

Step 1: Embedding Documents

First, we convert each abstract into a numerical vector that captures its semantic meaning.
from sentence_transformers import SentenceTransformer

# Create an embedding for each abstract
embedding_model = SentenceTransformer('thenlper/gte-small')
embeddings = embedding_model.encode(abstracts, show_progress_bar=True)
Check the dimensions:
embeddings.shape
Output:
(44949, 384)
Each of the 44,949 abstracts is now represented as a 384-dimensional vector.
We use thenlper/gte-small, a compact but powerful embedding model. It balances quality and speed, making it suitable for large document collections.

Step 2: Reducing the Dimensionality of Embeddings

High-dimensional embeddings (384 dimensions) can be difficult for clustering algorithms. We use UMAP (Uniform Manifold Approximation and Projection) to reduce dimensions while preserving semantic relationships.
from umap import UMAP

# We reduce the input embeddings from 384 dimensions to 5 dimensions
umap_model = UMAP(
    n_components=5,
    min_dist=0.0,
    metric='cosine',
    random_state=42
)
reduced_embeddings = umap_model.fit_transform(embeddings)
Why 5 dimensions? This strikes a balance:
  • More than 2-3 dimensions preserves more semantic information
  • Fewer than 10-20 dimensions makes clustering more effective
  • We can reduce to 2 dimensions later for visualization

Step 3: Cluster the Reduced Embeddings

Now we apply HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to discover clusters.
from hdbscan import HDBSCAN

# We fit the model and extract the clusters
hdbscan_model = HDBSCAN(
    min_cluster_size=50,
    metric='euclidean',
    cluster_selection_method='eom'
).fit(reduced_embeddings)
clusters = hdbscan_model.labels_

# How many clusters did we generate?
len(set(clusters))
Output:
156
We discovered 156 distinct clusters in the NLP research literature!
Why HDBSCAN?
  • Doesn’t require specifying the number of clusters upfront
  • Can identify clusters of varying densities
  • Automatically identifies outliers (labeled as -1)
  • Works well with the output of UMAP

Inspecting the Clusters

Let’s manually examine the first three documents in cluster 0:
import numpy as np

# Print first three documents in cluster 0
cluster = 0
for index in np.where(clusters==cluster)[0][:3]:
    print(abstracts[index][:300] + "... \n")
Output:
This works aims to design a statistical machine translation from English text
to American Sign Language (ASL). The system is based on Moses tool with some
modifications and the results are synthesized through a 3D avatar for
interpretation. First, we translate the input text to gloss, a written fo...

Researches on signed languages still strongly dissociate linguistic issues
related on phonological and phonetic aspects, and gesture studies for
recognition and synthesis purposes. This paper focuses on the imbrication of
motion and meaning for the analysis, synthesis and evaluation of sign lang...

Modern computational linguistic software cannot produce important aspects of
sign language translation. Using some researches we deduce that the majority of
automatic sign language translation systems ignore many aspects when they
generate animation; therefore the interpretation lost the truth inf...
All three abstracts are about sign language translation - the clustering worked!

Visualizing Clusters

To visualize our clusters, we reduce the embeddings to 2 dimensions:
import pandas as pd

# Reduce 384-dimensional embeddings to 2 dimensions for easier visualization
reduced_embeddings = UMAP(
    n_components=2,
    min_dist=0.0,
    metric='cosine',
    random_state=42
).fit_transform(embeddings)

# Create dataframe
df = pd.DataFrame(reduced_embeddings, columns=["x", "y"])
df["title"] = titles
df["cluster"] = [str(c) for c in clusters]

# Select outliers and non-outliers (clusters)
clusters_df = df.loc[df.cluster != "-1", :]
outliers_df = df.loc[df.cluster == "-1", :]
Create a static plot:
import matplotlib.pyplot as plt

# Plot outliers and non-outliers separately
plt.scatter(outliers_df.x, outliers_df.y, alpha=0.05, s=2, c="grey")
plt.scatter(
    clusters_df.x, clusters_df.y, c=clusters_df.cluster.astype(int),
    alpha=0.6, s=2, cmap='tab20b'
)
plt.axis('off')
The visualization shows clear clusters of related papers, with outliers in grey!
Outliers (cluster -1) represent papers that don’t fit well into any cluster. This is normal and often represents either very unique papers or papers that bridge multiple topics.

From Text Clustering to Topic Modeling

While clustering groups similar documents, topic modeling goes further by:
  1. Identifying what makes each cluster unique
  2. Extracting representative keywords
  3. Providing interpretable topic descriptions

BERTopic: A Modular Topic Modeling Framework

BERTopic combines our clustering pipeline with topic representation techniques to create interpretable topics.
from bertopic import BERTopic

# Train our model with our previously defined models
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    verbose=True
).fit(abstracts, embeddings)
Training output:
2024-04-24 10:38:31 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-24 10:39:22 - BERTopic - Dimensionality - Completed ✓
2024-04-24 10:39:22 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-24 10:39:24 - BERTopic - Cluster - Completed ✓
2024-04-24 10:39:24 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-24 10:39:34 - BERTopic - Representation - Completed ✓
Modularity is key: BERTopic allows you to swap out any component:
  • Use different embedding models (OpenAI, Cohere, local models)
  • Try different dimensionality reduction techniques
  • Experiment with clustering algorithms
  • Apply various representation models

Exploring Topics

View all topics with their counts and representations:
topic_model.get_topic_info()
Sample output:
TopicCountNameRepresentation
-114520-1_the_of_and_to[the, of, and, to, in, we, that, language…]
022900_speech_asr_recognition_end[speech, asr, recognition, end, acoustic…]
114031_medical_clinical_biomedical_patient[medical, clinical, biomedical, patient…]
211562_sentiment_aspect_analysis_reviews[sentiment, aspect, analysis, reviews…]
39863_translation_nmt_machine_neural[translation, nmt, machine, neural…]
The model discovered 156 topics covering different areas of NLP research!

Examining Topic Keywords

Get the top 10 keywords for a specific topic with their c-TF-IDF weights:
topic_model.get_topic(0)
Output:
[
    ('speech', 0.0282),
    ('asr', 0.0190),
    ('recognition', 0.0135),
    ('end', 0.0098),
    ('acoustic', 0.0095),
    ('speaker', 0.0069),
    ('audio', 0.0068),
    ('the', 0.0063),
    ('error', 0.0063),
    ('automatic', 0.0063)
]
Topic 0 is clearly about speech recognition and ASR (Automatic Speech Recognition)!
c-TF-IDF (class-based TF-IDF) is like TF-IDF but treats each cluster as a single document. It identifies words that are frequent within a topic but rare across other topics, making topic representations more distinctive.

Searching for Topics

Find topics related to a search term:
topic_model.find_topics("topic modeling")
Output:
([22, -1, 1, 47, 32], [0.955, 0.912, 0.907, 0.907, 0.905])
Topic 22 has 95.5% similarity to “topic modeling”! Let’s inspect it:
topic_model.get_topic(22)
Output:
[
    ('topic', 0.0663),
    ('topics', 0.0353),
    ('lda', 0.0164),
    ('latent', 0.0134),
    ('document', 0.0130),
    ('documents', 0.0124),
    ('modeling', 0.0120),
    ('dirichlet', 0.0101),
    ('word', 0.0085),
    ('allocation', 0.0079)
]
This topic includes classic topic modeling terms like LDA (Latent Dirichlet Allocation)! Verify the BERTopic paper is in this topic:
topic_model.topics_[titles.index('BERTopic: Neural topic modeling with a class-based TF-IDF procedure')]
Output:
22
Perfect! The BERTopic paper itself was correctly assigned to the topic modeling topic.

Visualizations

BERTopic provides rich interactive visualizations to explore topics.

Visualize Documents

Create an interactive plot showing all documents and their topics:
# Visualize topics and documents
fig = topic_model.visualize_documents(
    titles,
    reduced_embeddings=reduced_embeddings,
    width=1200,
    hide_annotations=True
)

# Update fonts of legend for easier visualization
fig.update_layout(font=dict(size=16))
This creates an interactive scatter plot where you can:
  • Hover over points to see document titles
  • Zoom into specific regions
  • Filter by topic
  • Explore the relationship between different topics

Additional Visualizations

# Visualize barchart with ranked keywords
topic_model.visualize_barchart()
The hierarchy visualization is particularly useful - it shows how topics can be merged into higher-level themes, revealing the hierarchical structure of your document collection.

Representation Models

BERTopic’s modularity shines when using representation models to improve topic descriptions beyond c-TF-IDF.

Available Representation Models

Uses embeddings to select the most representative keywords for each topic.
from bertopic.representation import KeyBERTInspired

representation_model = KeyBERTInspired()
topic_model.update_topics(abstracts, representation_model=representation_model)
Selects diverse keywords that are relevant but not redundant.
from bertopic.representation import MaximalMarginalRelevance

representation_model = MaximalMarginalRelevance(diversity=0.3)
topic_model.update_topics(abstracts, representation_model=representation_model)
Uses large language models to generate descriptive topic labels.
from bertopic.representation import OpenAI

representation_model = OpenAI(
    model="gpt-4",
    prompt="Generate a concise topic label for the following keywords: [KEYWORDS]"
)
topic_model.update_topics(abstracts, representation_model=representation_model)
Integrates with LangChain for custom LLM-based representations.
from bertopic.representation import LangChain
from langchain.chains import LLMChain

representation_model = LangChain(chain)
topic_model.update_topics(abstracts, representation_model=representation_model)

Updating Topics After Training

You can update topic representations after training, allowing quick iteration:
from copy import deepcopy

# Save original representations
original_topics = deepcopy(topic_model.topic_representations_)

# Update with a new representation model
from bertopic.representation import KeyBERTInspired
representation_model = KeyBERTInspired()
topic_model.update_topics(
    abstracts,
    representation_model=representation_model
)

# Compare before and after
def topic_differences(model, original_topics, nr_topics=5):
    """Show the differences in topic representations between two models"""
    for topic in range(nr_topics):
        print(f"\n--- Topic {topic} ---")
        print("Original:", [word for word, _ in original_topics[topic][:5]])
        print("Updated:", [word for word, _ in model.get_topic(topic)[:5]])

topic_differences(topic_model, original_topics)
Combine multiple representation models for richer topic descriptions:
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI

representation_models = [
    KeyBERTInspired(),
    MaximalMarginalRelevance(diversity=0.3),
    OpenAI(model="gpt-3.5-turbo")
]

topic_model = BERTopic(representation_model=representation_models)
This creates multiple representations for each topic, giving you different perspectives!

Practical Applications

Scenario: A research lab has thousands of papers and wants to organize them by theme.Solution:
  1. Extract paper abstracts
  2. Generate embeddings and cluster using BERTopic
  3. Use OpenAI representation model for human-readable topic names
  4. Create interactive visualizations for exploration
  5. Build a search interface using topic assignments
Benefits: Researchers can quickly find related papers, identify research gaps, and track field evolution.
Scenario: An e-commerce company receives thousands of product reviews daily.Solution:
  1. Collect review text
  2. Apply BERTopic to discover common themes
  3. Track topic prevalence over time
  4. Alert teams when new issues emerge (new topics)
  5. Generate automated reports on customer concerns
Benefits: Product teams can prioritize improvements, customer service can proactively address issues, and executives get data-driven insights.
Scenario: A news website wants to recommend related articles.Solution:
  1. Model topics across all articles
  2. Assign new articles to topics in real-time
  3. Recommend articles from the same or related topics
  4. Use topic hierarchy for broader recommendations
Benefits: Increased engagement, longer session times, and better content discovery.

Key Differences: Clustering vs Topic Modeling

AspectText ClusteringTopic Modeling
OutputDocument groupsDocument groups + topic descriptions
InterpretabilityRequires manual inspectionAutomatic keyword extraction
Use CaseOrganization, deduplicationAnalysis, exploration, search
ExamplesHDBSCAN, K-MeansLDA, NMF, BERTopic
FlexibilityGroup assignmentGroup + representation

Performance Considerations

Scalability Tips:
  1. Large datasets (>100K documents):
    • Use approximate nearest neighbors for UMAP
    • Consider batching embeddings
    • Use low_memory=True in HDBSCAN
  2. Speed optimization:
    • Use smaller embedding models (e.g., all-MiniLM-L6-v2)
    • Pre-compute and save embeddings
    • Reduce UMAP dimensions to 3-5 instead of 5-10
  3. Quality improvement:
    • Use larger embedding models (e.g., all-mpnet-base-v2)
    • Increase min_cluster_size for more coherent topics
    • Experiment with different UMAP parameters

Key Takeaways

1

Embeddings are foundational

High-quality embeddings are crucial for both clustering and topic modeling
2

Pipeline matters

The combination of embedding → dimensionality reduction → clustering works well across domains
3

BERTopic is modular

Swap components to customize for your specific use case
4

Representation models enhance topics

Use LLMs or specialized models to generate better topic descriptions
5

Visualization aids understanding

Interactive plots help explore and validate discovered topics

Next Steps

Now that you understand text clustering and topic modeling, you can:
  • Apply these techniques to your own document collections
  • Experiment with different embedding models and parameters
  • Build search and recommendation systems using topics
  • Track topic evolution over time in dynamic corpora

Try the notebook yourself: Open In Colab

Build docs developers (and LLMs) love