Skip to main content
Clustering algorithms group similar data points together without requiring labeled training data. Spark MLlib provides several clustering algorithms for different use cases.

K-means

K-means is one of the most commonly used clustering algorithms that partitions data into K predefined clusters.

Overview

MLlib implements a parallelized variant of k-means++ called kmeans||. The algorithm:
  1. Initializes cluster centers using k-means++
  2. Iteratively assigns points to nearest centers
  3. Updates centers based on assigned points
  4. Repeats until convergence

Input and Output

Input Columns:
  • featuresCol: Feature vector (default: “features”)
Output Columns:
  • predictionCol: Predicted cluster center (default: “prediction”)

Example

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Load data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Train k-means model with 2 clusters
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate clustering using Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette with squared euclidean distance = {silhouette}")

# Show cluster centers
centers = model.clusterCenters()
print("Cluster Centers:")
for center in centers:
    print(center)

Parameters

k
int
default:"2"
Number of clusters to create
initMode
string
default:"k-means||"
Initialization algorithm: “k-means||” or “random”
initSteps
int
default:"2"
Number of steps for k-means|| initialization
maxIter
int
default:"20"
Maximum number of iterations
tol
double
default:"1e-4"
Convergence tolerance

Latent Dirichlet Allocation (LDA)

LDA is a topic modeling algorithm that discovers abstract topics in a collection of documents.

Overview

LDA models documents as mixtures of topics, where each topic is a probability distribution over words. MLlib supports:
  • EMLDAOptimizer: Expectation-Maximization algorithm
  • OnlineLDAOptimizer: Online variational Bayes algorithm

Example

from pyspark.ml.clustering import LDA

# Load and parse data
data = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")

# Train LDA model
lda = LDA(k=10, maxIter=10)
model = lda.fit(data)

# Get log likelihood and perplexity
ll = model.logLikelihood(data)
lp = model.logPerplexity(data)
print(f"The lower bound on the log likelihood: {ll}")
print(f"The upper bound on perplexity: {lp}")

# Describe topics
topics = model.describeTopics(3)
print("Topics:")
topics.show(truncate=False)

# Transform documents
transformed = model.transform(data)
transformed.show(truncate=False)

Parameters

k
int
default:"10"
Number of topics to infer
optimizer
string
default:"online"
Optimizer: “online” or “em”
maxIter
int
default:"20"
Maximum number of iterations

Bisecting K-means

Bisecting k-means is a hierarchical clustering algorithm using a divisive (top-down) approach.

Overview

The algorithm:
  1. Starts with all observations in one cluster
  2. Recursively splits clusters using k-means (k=2)
  3. Continues until reaching desired number of clusters
Bisecting k-means can be much faster than regular k-means, but produces a different clustering structure.

Example

from pyspark.ml.clustering import BisectingKMeans

# Load data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Train bisecting k-means model
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette with squared euclidean distance = {silhouette}")

# Show cluster centers
print("Cluster Centers:")
centers = model.clusterCenters()
for center in centers:
    print(center)

Gaussian Mixture Model (GMM)

GMM represents a composite distribution where points are drawn from one of K Gaussian sub-distributions.

Overview

GMM uses the expectation-maximization algorithm to induce a maximum-likelihood model. Unlike k-means, GMM:
  • Models each cluster as a Gaussian distribution
  • Provides soft cluster assignments (probabilities)
  • Can represent overlapping clusters

Input and Output

Input Columns:
  • featuresCol: Feature vector (default: “features”)
Output Columns:
  • predictionCol: Predicted cluster center (default: “prediction”)
  • probabilityCol: Probability of each cluster (default: “probability”)

Example

from pyspark.ml.clustering import GaussianMixture

# Load data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Train Gaussian Mixture model
gmm = GaussianMixture().setK(2).setSeed(1)
model = gmm.fit(dataset)

# Make predictions
predictions = model.transform(dataset)
predictions.select("features", "prediction", "probability").show(5, truncate=False)

# Output parameters of Gaussian clusters
print("Gaussian Centers:")
for i, center in enumerate(model.gaussiansDF.collect()):
    print(f"Cluster {i}:")
    print(f"  Mean: {center['mean']}")
    print(f"  Covariance:\n{center['cov']}")

# Evaluate
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette with squared euclidean distance = {silhouette}")

Parameters

k
int
default:"2"
Number of Gaussian sub-distributions
maxIter
int
default:"100"
Maximum number of iterations
tol
double
default:"0.01"
Convergence tolerance

Power Iteration Clustering (PIC)

PIC is a scalable graph clustering algorithm developed by Lin and Cohen.

Overview

PIC finds a low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix. It’s particularly useful for clustering graph data.

Parameters

k
int
required
Number of clusters to create
maxIter
int
default:"20"
Maximum number of iterations
srcCol
string
required
Name of input column for source vertex IDs
dstCol
string
required
Name of input column for destination vertex IDs
weightCol
string
Weight column name (optional)

Example

from pyspark.ml.clustering import PowerIterationClustering

# Create graph data (source, destination, similarity)
data = spark.createDataFrame([
    (0, 1, 0.9),
    (1, 2, 0.9),
    (2, 3, 0.9),
    (3, 4, 0.1),
    (4, 5, 0.9),
], ["src", "dst", "weight"])

# Create PIC model
pic = PowerIterationClustering(
    k=2, 
    maxIter=10, 
    initMode="degree",
    srcCol="src",
    dstCol="dst",
    weightCol="weight"
)

# Run PIC
model = pic.assignClusters(data)

# Show cluster assignments
model.select("id", "cluster").show()

Choosing a Clustering Algorithm

Use when:
  • You know the number of clusters
  • Clusters are roughly spherical
  • You need fast, scalable clustering
Advantages:
  • Fast and scalable
  • Simple to understand and implement
  • Works well for many applications
Limitations:
  • Must specify K in advance
  • Sensitive to initialization
  • Assumes spherical clusters
Use when:
  • Clusters have different sizes/shapes
  • You need soft cluster assignments
  • Data follows Gaussian distributions
Advantages:
  • Provides probability estimates
  • Handles overlapping clusters
  • More flexible than k-means
Limitations:
  • Slower than k-means
  • More complex to tune
  • May converge to local optima
Use when:
  • You want hierarchical clustering
  • Regular k-means is too slow
  • You need a dendrogram structure
Advantages:
  • Often faster than k-means
  • Produces hierarchy of clusters
  • Good for large datasets
Limitations:
  • Different results than k-means
  • Still requires specifying K
  • Top-down only (not bottom-up)
Use when:
  • Working with text documents
  • You need topic modeling
  • Documents contain multiple topics
Advantages:
  • Discovers latent topics
  • Soft assignments (documents can have multiple topics)
  • Interpretable results
Limitations:
  • Designed for text data
  • Requires careful tuning
  • Computationally intensive
Use when:
  • Working with graph data
  • You have similarity matrices
  • Scalability is critical
Advantages:
  • Highly scalable
  • Works on graph structures
  • Fast convergence
Limitations:
  • Requires graph input
  • Less common than other methods
  • Needs similarity/weight data

Evaluation Metrics

Silhouette Score

The Silhouette score measures how similar an object is to its own cluster compared to other clusters:
from pyspark.ml.evaluation import ClusteringEvaluator

evaluator = ClusteringEvaluator(
    predictionCol="prediction",
    featuresCol="features",
    metricName="silhouette",
    distanceMeasure="squaredEuclidean"
)

silhouette = evaluator.evaluate(predictions)
print(f"Silhouette Score = {silhouette}")
The Silhouette score ranges from -1 to 1:
  • 1: Perfect clustering
  • 0: Overlapping clusters
  • -1: Points assigned to wrong clusters

Best Practices

Normalize features to similar ranges before clustering. Different scales can bias distance calculations:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(data)
scaledData = scalerModel.transform(data)
Use the elbow method or silhouette analysis to determine optimal K:
silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans().setK(k).setSeed(1)
    model = kmeans.fit(data)
    predictions = model.transform(data)
    score = evaluator.evaluate(predictions)
    silhouette_scores.append(score)
Remove or handle outliers before clustering, as they can significantly affect cluster centers and assignments.
Run clustering multiple times with different seeds and choose the best result based on evaluation metrics.

Next Steps

Feature Engineering

Learn how to prepare and transform features

Classification

Explore supervised learning algorithms

Collaborative Filtering

Build recommendation systems

ML Pipelines

Create end-to-end ML workflows

Build docs developers (and LLMs) love