K-means
K-means is one of the most commonly used clustering algorithms that partitions data into K predefined clusters.Overview
MLlib implements a parallelized variant of k-means++ called kmeans||. The algorithm:- Initializes cluster centers using k-means++
- Iteratively assigns points to nearest centers
- Updates centers based on assigned points
- Repeats until convergence
Input and Output
Input Columns:featuresCol: Feature vector (default: “features”)
predictionCol: Predicted cluster center (default: “prediction”)
Example
Parameters
Number of clusters to create
Initialization algorithm: “k-means||” or “random”
Number of steps for k-means|| initialization
Maximum number of iterations
Convergence tolerance
Latent Dirichlet Allocation (LDA)
LDA is a topic modeling algorithm that discovers abstract topics in a collection of documents.Overview
LDA models documents as mixtures of topics, where each topic is a probability distribution over words. MLlib supports:- EMLDAOptimizer: Expectation-Maximization algorithm
- OnlineLDAOptimizer: Online variational Bayes algorithm
Example
Parameters
Number of topics to infer
Optimizer: “online” or “em”
Maximum number of iterations
Bisecting K-means
Bisecting k-means is a hierarchical clustering algorithm using a divisive (top-down) approach.Overview
The algorithm:- Starts with all observations in one cluster
- Recursively splits clusters using k-means (k=2)
- Continues until reaching desired number of clusters
Example
Gaussian Mixture Model (GMM)
GMM represents a composite distribution where points are drawn from one of K Gaussian sub-distributions.Overview
GMM uses the expectation-maximization algorithm to induce a maximum-likelihood model. Unlike k-means, GMM:- Models each cluster as a Gaussian distribution
- Provides soft cluster assignments (probabilities)
- Can represent overlapping clusters
Input and Output
Input Columns:featuresCol: Feature vector (default: “features”)
predictionCol: Predicted cluster center (default: “prediction”)probabilityCol: Probability of each cluster (default: “probability”)
Example
Parameters
Number of Gaussian sub-distributions
Maximum number of iterations
Convergence tolerance
Power Iteration Clustering (PIC)
PIC is a scalable graph clustering algorithm developed by Lin and Cohen.Overview
PIC finds a low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix. It’s particularly useful for clustering graph data.Parameters
Number of clusters to create
Maximum number of iterations
Name of input column for source vertex IDs
Name of input column for destination vertex IDs
Weight column name (optional)
Example
Choosing a Clustering Algorithm
K-means
K-means
Use when:
- You know the number of clusters
- Clusters are roughly spherical
- You need fast, scalable clustering
- Fast and scalable
- Simple to understand and implement
- Works well for many applications
- Must specify K in advance
- Sensitive to initialization
- Assumes spherical clusters
GMM
GMM
Use when:
- Clusters have different sizes/shapes
- You need soft cluster assignments
- Data follows Gaussian distributions
- Provides probability estimates
- Handles overlapping clusters
- More flexible than k-means
- Slower than k-means
- More complex to tune
- May converge to local optima
Bisecting K-means
Bisecting K-means
Use when:
- You want hierarchical clustering
- Regular k-means is too slow
- You need a dendrogram structure
- Often faster than k-means
- Produces hierarchy of clusters
- Good for large datasets
- Different results than k-means
- Still requires specifying K
- Top-down only (not bottom-up)
LDA
LDA
Use when:
- Working with text documents
- You need topic modeling
- Documents contain multiple topics
- Discovers latent topics
- Soft assignments (documents can have multiple topics)
- Interpretable results
- Designed for text data
- Requires careful tuning
- Computationally intensive
PIC
PIC
Use when:
- Working with graph data
- You have similarity matrices
- Scalability is critical
- Highly scalable
- Works on graph structures
- Fast convergence
- Requires graph input
- Less common than other methods
- Needs similarity/weight data
Evaluation Metrics
Silhouette Score
The Silhouette score measures how similar an object is to its own cluster compared to other clusters:- 1: Perfect clustering
- 0: Overlapping clusters
- -1: Points assigned to wrong clusters
Best Practices
Scale your features
Scale your features
Normalize features to similar ranges before clustering. Different scales can bias distance calculations:
Choose K wisely
Choose K wisely
Use the elbow method or silhouette analysis to determine optimal K:
Handle outliers
Handle outliers
Remove or handle outliers before clustering, as they can significantly affect cluster centers and assignments.
Try multiple initializations
Try multiple initializations
Run clustering multiple times with different seeds and choose the best result based on evaluation metrics.
Next Steps
Feature Engineering
Learn how to prepare and transform features
Classification
Explore supervised learning algorithms
Collaborative Filtering
Build recommendation systems
ML Pipelines
Create end-to-end ML workflows
