Chapter 9 explores unsupervised learning—finding structure in data without labels. You will apply K-Means clustering to group similar instances, use DBSCAN to identify clusters of arbitrary shape, and fit Gaussian Mixture Models (GMMs) that provide soft cluster assignments and can model data generation. The chapter also covers anomaly detection, image segmentation, and semi-supervised learning.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/ageron/handson-ml3/llms.txt
Use this file to discover all available pages before exploring further.
What you’ll learn
- The difference between classification (supervised) and clustering (unsupervised)
- K-Means clustering with
KMeans: centroids, labels, and inertia - K-Means initialisation with K-Means++ and the
n_initparameter - Mini-batch K-Means with
MiniBatchKMeansfor large datasets - Evaluating cluster quality with inertia and the silhouette score
- Finding the optimal k using the elbow method and silhouette diagrams
- DBSCAN for density-based clustering of arbitrary-shaped clusters
- Gaussian Mixture Models (GMM) with
GaussianMixture - Model selection for GMMs using BIC and AIC
- Anomaly and novelty detection with GMMs
Key concepts
K-Means. K-Means partitions data into k clusters by alternating between assigning each instance to the nearest centroid and updating centroids to the mean of assigned instances. The algorithm is fast but assumes spherical clusters of similar size. Inertia—the sum of squared distances from each instance to its assigned centroid—measures within-cluster compactness; lower is better, but it always decreases as k increases so it cannot be used alone to choose k. Silhouette score. The silhouette coefficient for an instance is (b − a) / max(a, b), where a is the mean distance to instances in the same cluster and b is the mean distance to instances in the nearest other cluster. A score near +1 means the instance is well-placed; near 0 means it is on a cluster boundary; near −1 means it is in the wrong cluster. The average silhouette score across all instances is a more principled way to choose k than inertia alone. DBSCAN. Density-Based Spatial Clustering of Applications with Noise identifies clusters as dense regions separated by sparse regions. It can find arbitrarily shaped clusters and automatically labels outliers as noise. The key hyperparameters areeps (neighbourhood radius) and min_samples (minimum cluster size).
Gaussian Mixture Models. A GMM assumes the data was generated from a mixture of Gaussian distributions. The EM algorithm fits the parameters (means, covariances, mixing weights) by iterating expectation (soft assignments) and maximisation (parameter updates). GMMs support both hard and soft cluster assignments and can estimate the density of each point—useful for anomaly detection.
Code examples
K-Means clustering:Running this notebook
Open in Colab
No external data download needed for early sections
The clustering examples use synthetic data (
make_blobs) and the iris dataset (built into Scikit-Learn). The MNIST-based sections in the middle of the notebook call fetch_openml to download the dataset.