Skip to main content

Overview

Clustering algorithms group similar data points together without labeled training data. bun-scikit provides implementations of popular clustering methods for various use cases.

K-Means

Partition-based clustering

DBSCAN

Density-based spatial clustering

Hierarchical

Agglomerative clustering

Spectral

Graph-based clustering

K-Means

K-Means partitions data into K clusters by minimizing within-cluster variance.

Basic Usage

import { KMeans } from "bun-scikit";

const X = [
  [0, 0], [0.1, -0.1], [-0.2, 0.1],    // Cluster 1
  [10, 10], [10.2, 9.9], [9.8, 10.1],  // Cluster 2
];

const kmeans = new KMeans({
  nClusters: 2,
  randomState: 42,
  nInit: 10,
  maxIter: 300,
});

kmeans.fit(X);

console.log("Cluster centers:", kmeans.clusterCenters_);
console.log("Labels:", kmeans.labels_);
console.log("Inertia:", kmeans.inertia_);
console.log("Iterations:", kmeans.nIter_);

Configuration Options

nClusters
number
default:"8"
Number of clusters to form
nInit
number
default:"10"
Number of times the algorithm runs with different centroid seeds. Best result is kept.
maxIter
number
default:"300"
Maximum number of iterations for a single run
tolerance
number
default:"1e-4"
Convergence tolerance for centroid movement
randomState
number
default:"undefined"
Random seed for reproducibility

Predicting New Samples

// Assign new points to existing clusters
const newPoints = [[0.5, 0.5], [10.5, 10.5]];
const labels = kmeans.predict(newPoints);
console.log("Assigned clusters:", labels);

// Get distances to all cluster centers
const distances = kmeans.transform(newPoints);
console.log("Distances to centers:", distances);

Attributes

After fitting, KMeans exposes:
  • clusterCenters_: Coordinates of cluster centers
  • labels_: Cluster label for each training sample
  • inertia_: Sum of squared distances to nearest cluster center
  • nIter_: Number of iterations run
  • nFeaturesIn_: Number of features in the input

Finding Optimal K

Use the elbow method to find the optimal number of clusters:
const inertias = [];
const kRange = [2, 3, 4, 5, 6, 7, 8];

for (const k of kRange) {
  const model = new KMeans({ nClusters: k, randomState: 42 });
  model.fit(X);
  inertias.push(model.inertia_!);
}

console.log("K values:", kRange);
console.log("Inertias:", inertias);
// Plot and look for the "elbow" point

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density and can identify outliers.

Basic Usage

import { DBSCAN } from "bun-scikit";

const X = [
  [0, 0], [0, 1], [1, 0],          // Dense cluster 1
  [10, 10], [10, 11], [11, 10],    // Dense cluster 2
  [5, 5],                           // Outlier
];

const dbscan = new DBSCAN({
  eps: 1.5,         // Maximum distance between neighbors
  minSamples: 2,    // Minimum points to form a dense region
});

dbscan.fit(X);

console.log("Labels:", dbscan.labels_);
// -1 indicates noise/outliers

console.log("Core sample indices:", dbscan.coreSampleIndices_);
console.log("Core samples:", dbscan.components_);

Configuration

eps
number
default:"0.5"
Maximum distance between two samples to be considered neighbors
minSamples
number
default:"5"
Number of samples in a neighborhood for a point to be considered a core point

Understanding Labels

dbscan.fit(X);

const labels = dbscan.labels_!;
const uniqueLabels = [...new Set(labels)];

console.log(`Found ${uniqueLabels.filter(l => l !== -1).length} clusters`);

const noisePoints = labels.filter(l => l === -1).length;
console.log(`${noisePoints} noise points detected`);

// Get points in each cluster
for (const label of uniqueLabels) {
  if (label === -1) continue;
  const clusterPoints = X.filter((_, i) => labels[i] === label);
  console.log(`Cluster ${label}: ${clusterPoints.length} points`);
}
DBSCAN is excellent for finding clusters of arbitrary shape and identifying outliers, but requires careful tuning of eps and minSamples parameters.

Hierarchical Clustering

Agglomerative clustering builds a hierarchy of clusters using a bottom-up approach.

Basic Usage

import { AgglomerativeClustering } from "bun-scikit";

const X = [
  [0, 0], [0, 1], [1, 0],
  [5, 5], [5, 6], [6, 5],
  [10, 10], [10, 11], [11, 10],
];

const agg = new AgglomerativeClustering({
  nClusters: 3,
  linkage: "ward",
  metric: "euclidean",
});

agg.fit(X);

console.log("Labels:", agg.labels_);
console.log("Number of leaves:", agg.nLeaves_);

Configuration

nClusters
number
default:"2"
Number of clusters to find
linkage
'ward' | 'complete' | 'average' | 'single'
default:"'ward'"
Linkage criterion:
  • ward: Minimizes variance of clusters
  • complete: Maximum distance between clusters
  • average: Average distance between clusters
  • single: Minimum distance between clusters
metric
string
default:"'euclidean'"
Distance metric to use

Distance Threshold

Instead of specifying number of clusters, use distance threshold:
const aggDist = new AgglomerativeClustering({
  nClusters: null,
  distanceThreshold: 5.0,
  linkage: "average",
});

aggDist.fit(X);

// Number of clusters is determined by distance threshold
const nClusters = new Set(aggDist.labels_).size;
console.log(`Found ${nClusters} clusters`);

Spectral Clustering

Spectral clustering uses graph-based methods and works well for non-convex clusters.

Basic Usage

import { SpectralClustering } from "bun-scikit";

const X = [
  [0, 0], [0, 1], [1, 0],
  [10, 10], [10, 11], [11, 10],
];

const spectral = new SpectralClustering({
  nClusters: 2,
  affinity: "rbf",
  gamma: 1.0,
  randomState: 42,
});

spectral.fit(X);

console.log("Labels:", spectral.labels_);

Configuration

nClusters
number
default:"8"
Number of clusters
affinity
'rbf' | 'nearest_neighbors' | 'precomputed'
default:"'rbf'"
How to construct the affinity matrix
gamma
number
default:"1.0"
Kernel coefficient for ‘rbf’ affinity

BIRCH

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is memory-efficient for large datasets.

Basic Usage

import { Birch } from "bun-scikit";

const birch = new Birch({
  nClusters: 3,
  threshold: 0.5,
  branchingFactor: 50,
});

birch.fit(X);

console.log("Labels:", birch.labels_);
console.log("Subcluster centers:", birch.subclusterCenters_);

OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) is similar to DBSCAN but produces a reachability plot.

Basic Usage

import { OPTICS } from "bun-scikit";

const optics = new OPTICS({
  minSamples: 5,
  maxEps: 2.0,
  metric: "euclidean",
});

optics.fit(X);

console.log("Labels:", optics.labels_);
console.log("Reachability:", optics.reachability_);
console.log("Ordering:", optics.ordering_);

Clustering Evaluation

Silhouette Score

Measure clustering quality:
import { silhouetteScore } from "bun-scikit";

kmeans.fit(X);
const score = silhouetteScore(X, kmeans.labels_!);
console.log("Silhouette score:", score); // -1 to 1, higher is better

Calinski-Harabasz Index

import { calinskiHarabaszScore } from "bun-scikit";

const chScore = calinskiHarabaszScore(X, kmeans.labels_!);
console.log("Calinski-Harabasz score:", chScore); // Higher is better

Davies-Bouldin Index

import { daviesBouldinScore } from "bun-scikit";

const dbScore = daviesBouldinScore(X, kmeans.labels_!);
console.log("Davies-Bouldin score:", dbScore); // Lower is better

Common Patterns

Clustering Pipeline

import { Pipeline } from "bun-scikit";
import { StandardScaler } from "bun-scikit";
import { KMeans } from "bun-scikit";

const pipe = new Pipeline([
  ["scaler", new StandardScaler()],
  ["kmeans", new KMeans({ nClusters: 3 })],
]);

pipe.fit(X);

// Access the fitted KMeans
const kmeans = pipe.namedSteps_.get("kmeans") as KMeans;
console.log("Centers:", kmeans.clusterCenters_);

Feature Scaling

import { StandardScaler } from "bun-scikit";

// Always scale features before clustering
const scaler = new StandardScaler();
const X_scaled = scaler.fitTransform(X);

const kmeans = new KMeans({ nClusters: 3 });
kmeans.fit(X_scaled);

Comparing Multiple Algorithms

const algorithms = [
  ["K-Means", new KMeans({ nClusters: 3, randomState: 42 })],
  ["DBSCAN", new DBSCAN({ eps: 0.5, minSamples: 5 })],
  ["Agglomerative", new AgglomerativeClustering({ nClusters: 3 })],
];

for (const [name, algo] of algorithms) {
  algo.fit(X);
  const score = silhouetteScore(X, algo.labels_!);
  console.log(`${name}: silhouette = ${score.toFixed(4)}`);
}

Performance Tips

  • K-Means: Fast, works well with spherical clusters, requires K to be specified
  • DBSCAN: Handles arbitrary shapes, finds outliers, sensitive to parameters
  • Hierarchical: No need to specify K upfront, computationally expensive for large datasets
  • Spectral: Works with non-convex clusters, slower than K-Means
  • BIRCH: Memory-efficient for very large datasets
Always standardize features before clustering:
const scaler = new StandardScaler();
const X_scaled = scaler.fitTransform(X);
This ensures all features contribute equally to distance calculations.
For high-dimensional data, reduce dimensions first:
import { PCA } from "bun-scikit";

const pca = new PCA({ nComponents: 10 });
const X_reduced = pca.fitTransform(X);

kmeans.fit(X_reduced);

Use Cases

Customer Segmentation

Group customers by behavior patterns using K-Means

Anomaly Detection

Find outliers with DBSCAN’s noise detection

Image Segmentation

Partition images using spectral clustering

Document Clustering

Group similar documents with hierarchical clustering

Next Steps

Dimensionality Reduction

Use PCA before clustering

Model Selection

Evaluate clustering quality

Build docs developers (and LLMs) love