Clustering - Apache Spark

Clustering algorithms group similar data points together without requiring labeled training data. Spark MLlib provides several clustering algorithms for different use cases.

K-means

K-means is one of the most commonly used clustering algorithms that partitions data into K predefined clusters.

Overview

MLlib implements a parallelized variant of k-means++ called kmeans||. The algorithm:

Initializes cluster centers using k-means++
Iteratively assigns points to nearest centers
Updates centers based on assigned points
Repeats until convergence

Input and Output

Input Columns:

featuresCol: Feature vector (default: “features”)

Output Columns:

predictionCol: Predicted cluster center (default: “prediction”)

Example

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Load data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Train k-means model with 2 clusters
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate clustering using Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette with squared euclidean distance = {silhouette}")

# Show cluster centers
centers = model.clusterCenters()
print("Cluster Centers:")
for center in centers:
    print(center)

Parameters

int

default:"2"

Number of clusters to create

initMode

string

default:"k-means||"

Initialization algorithm: “k-means||” or “random”

initSteps

int

default:"2"

Number of steps for k-means|| initialization

maxIter

int

default:"20"

Maximum number of iterations

tol

double

default:"1e-4"

Convergence tolerance

Latent Dirichlet Allocation (LDA)

LDA is a topic modeling algorithm that discovers abstract topics in a collection of documents.

Overview

LDA models documents as mixtures of topics, where each topic is a probability distribution over words. MLlib supports:

EMLDAOptimizer: Expectation-Maximization algorithm
OnlineLDAOptimizer: Online variational Bayes algorithm

Example

from pyspark.ml.clustering import LDA

# Load and parse data
data = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")

# Train LDA model
lda = LDA(k=10, maxIter=10)
model = lda.fit(data)

# Get log likelihood and perplexity
ll = model.logLikelihood(data)
lp = model.logPerplexity(data)
print(f"The lower bound on the log likelihood: {ll}")
print(f"The upper bound on perplexity: {lp}")

# Describe topics
topics = model.describeTopics(3)
print("Topics:")
topics.show(truncate=False)

# Transform documents
transformed = model.transform(data)
transformed.show(truncate=False)

Parameters

int

default:"10"

Number of topics to infer

optimizer

string

default:"online"

Optimizer: “online” or “em”

maxIter

int

default:"20"

Maximum number of iterations

Bisecting K-means

Bisecting k-means is a hierarchical clustering algorithm using a divisive (top-down) approach.

Overview

The algorithm:

Starts with all observations in one cluster
Recursively splits clusters using k-means (k=2)
Continues until reaching desired number of clusters

Bisecting k-means can be much faster than regular k-means, but produces a different clustering structure.

Example

from pyspark.ml.clustering import BisectingKMeans

# Load data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Train bisecting k-means model
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette with squared euclidean distance = {silhouette}")

# Show cluster centers
print("Cluster Centers:")
centers = model.clusterCenters()
for center in centers:
    print(center)

Gaussian Mixture Model (GMM)

GMM represents a composite distribution where points are drawn from one of K Gaussian sub-distributions.

Overview

GMM uses the expectation-maximization algorithm to induce a maximum-likelihood model. Unlike k-means, GMM:

Models each cluster as a Gaussian distribution
Provides soft cluster assignments (probabilities)
Can represent overlapping clusters

Input and Output

Input Columns:

featuresCol: Feature vector (default: “features”)

Output Columns:

predictionCol: Predicted cluster center (default: “prediction”)
probabilityCol: Probability of each cluster (default: “probability”)

Example

from pyspark.ml.clustering import GaussianMixture

# Load data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Train Gaussian Mixture model
gmm = GaussianMixture().setK(2).setSeed(1)
model = gmm.fit(dataset)

# Make predictions
predictions = model.transform(dataset)
predictions.select("features", "prediction", "probability").show(5, truncate=False)

# Output parameters of Gaussian clusters
print("Gaussian Centers:")
for i, center in enumerate(model.gaussiansDF.collect()):
    print(f"Cluster {i}:")
    print(f"  Mean: {center['mean']}")
    print(f"  Covariance:\n{center['cov']}")

# Evaluate
silhouette = evaluator.evaluate(predictions)
print(f"Silhouette with squared euclidean distance = {silhouette}")

Parameters

int

default:"2"

Number of Gaussian sub-distributions

maxIter

int

default:"100"

Maximum number of iterations

tol

double

default:"0.01"

Convergence tolerance

Power Iteration Clustering (PIC)

PIC is a scalable graph clustering algorithm developed by Lin and Cohen.

Overview

PIC finds a low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix. It’s particularly useful for clustering graph data.

Parameters

int

required

Number of clusters to create

maxIter

int

default:"20"

Maximum number of iterations

srcCol

string

required

Name of input column for source vertex IDs

dstCol

string

required

Name of input column for destination vertex IDs

weightCol

string

Weight column name (optional)

Example

from pyspark.ml.clustering import PowerIterationClustering

# Create graph data (source, destination, similarity)
data = spark.createDataFrame([
    (0, 1, 0.9),
    (1, 2, 0.9),
    (2, 3, 0.9),
    (3, 4, 0.1),
    (4, 5, 0.9),
], ["src", "dst", "weight"])

# Create PIC model
pic = PowerIterationClustering(
    k=2, 
    maxIter=10, 
    initMode="degree",
    srcCol="src",
    dstCol="dst",
    weightCol="weight"
)

# Run PIC
model = pic.assignClusters(data)

# Show cluster assignments
model.select("id", "cluster").show()

Choosing a Clustering Algorithm

K-means

Use when:

You know the number of clusters
Clusters are roughly spherical
You need fast, scalable clustering

Advantages:

Fast and scalable
Simple to understand and implement
Works well for many applications

Limitations:

Must specify K in advance
Sensitive to initialization
Assumes spherical clusters

GMM

Use when:

Clusters have different sizes/shapes
You need soft cluster assignments
Data follows Gaussian distributions

Advantages:

Provides probability estimates
Handles overlapping clusters
More flexible than k-means

Limitations:

Slower than k-means
More complex to tune
May converge to local optima

Bisecting K-means

Use when:

You want hierarchical clustering
Regular k-means is too slow
You need a dendrogram structure

Advantages:

Often faster than k-means
Produces hierarchy of clusters
Good for large datasets

Limitations:

Different results than k-means
Still requires specifying K
Top-down only (not bottom-up)

LDA

Use when:

Working with text documents
You need topic modeling
Documents contain multiple topics

Advantages:

Discovers latent topics
Soft assignments (documents can have multiple topics)
Interpretable results

Limitations:

Designed for text data
Requires careful tuning
Computationally intensive

PIC

Use when:

Working with graph data
You have similarity matrices
Scalability is critical

Advantages:

Highly scalable
Works on graph structures
Fast convergence

Limitations:

Requires graph input
Less common than other methods
Needs similarity/weight data

Evaluation Metrics

Silhouette Score

The Silhouette score measures how similar an object is to its own cluster compared to other clusters:

from pyspark.ml.evaluation import ClusteringEvaluator

evaluator = ClusteringEvaluator(
    predictionCol="prediction",
    featuresCol="features",
    metricName="silhouette",
    distanceMeasure="squaredEuclidean"
)

silhouette = evaluator.evaluate(predictions)
print(f"Silhouette Score = {silhouette}")

The Silhouette score ranges from -1 to 1:

1: Perfect clustering
0: Overlapping clusters
-1: Points assigned to wrong clusters

Best Practices

Scale your features

Normalize features to similar ranges before clustering. Different scales can bias distance calculations:

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scalerModel = scaler.fit(data)
scaledData = scalerModel.transform(data)

Choose K wisely

Use the elbow method or silhouette analysis to determine optimal K:

silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans().setK(k).setSeed(1)
    model = kmeans.fit(data)
    predictions = model.transform(data)
    score = evaluator.evaluate(predictions)
    silhouette_scores.append(score)

Handle outliers

Remove or handle outliers before clustering, as they can significantly affect cluster centers and assignments.

Try multiple initializations

Run clustering multiple times with different seeds and choose the best result based on evaluation metrics.

Next Steps

Feature Engineering

Learn how to prepare and transform features

Classification

Explore supervised learning algorithms

Collaborative Filtering

Build recommendation systems

ML Pipelines

Create end-to-end ML workflows

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​K-means

​Overview

​Input and Output

​Example

​Parameters

​Latent Dirichlet Allocation (LDA)

​Overview

​Example

​Parameters

​Bisecting K-means

​Overview

​Example

​Gaussian Mixture Model (GMM)

​Overview

​Input and Output

​Example

​Parameters

​Power Iteration Clustering (PIC)

​Overview

​Parameters

​Example

​Choosing a Clustering Algorithm

​Evaluation Metrics

​Silhouette Score

​Best Practices

​Next Steps

Feature Engineering

Classification

Collaborative Filtering

ML Pipelines

Build docs developers (and LLMs) love

K-means

Overview

Input and Output

Example

Parameters

Latent Dirichlet Allocation (LDA)

Overview

Example

Parameters

Bisecting K-means

Overview

Example

Gaussian Mixture Model (GMM)

Overview

Input and Output

Example

Parameters

Power Iteration Clustering (PIC)

Overview

Parameters

Example

Choosing a Clustering Algorithm

Evaluation Metrics

Silhouette Score

Best Practices

Next Steps