Skip to main content

Overview

The KMeansClustering class divides the training data into clusters before model training. This approach allows the system to train specialized models for each data segment, improving overall prediction accuracy.

Class: KMeansClustering

Location: source/data_preprocessing/clustering.py Version: 1.0

Constructor

KMeansClustering(file_object, logger_object)
file_object
File
required
File object for logging operations
logger_object
Logger
required
Logger instance for tracking clustering operations

Methods

elbow_plot()

Determines the optimal number of clusters using the Elbow Method and KneeLocator algorithm.
elbow_plot(data)
data
pandas.DataFrame
required
Preprocessed feature data for clustering
return
int
Optimal number of clusters determined by the elbow method
How It Works:
  1. Tests cluster counts from 1 to 10
  2. Calculates WCSS (Within-Cluster Sum of Squares) for each
  3. Creates an elbow plot visualization
  4. Uses KneeLocator to programmatically identify the optimal point
Example Usage:
from data_preprocessing.clustering import KMeansClustering

kmeans_clustering = KMeansClustering(file_object, logger_object)
optimal_clusters = kmeans_clustering.elbow_plot(X_train)

print(f"Optimal number of clusters: {optimal_clusters}")
Implementation:
wcss = []
try:
    for i in range(1, 11):
        kmeans = KMeans(
            n_clusters=i, 
            init='k-means++', 
            random_state=42
        )
        kmeans.fit(data)
        wcss.append(kmeans.inertia_)
    
    plt.plot(range(1, 11), wcss)
    plt.title('The Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.savefig('preprocessing_data/K-Means_Elbow.PNG')
    
    # Find optimal clusters programmatically
    self.kn = KneeLocator(
        range(1, 11), 
        wcss, 
        curve='convex', 
        direction='decreasing'
    )
    
    return self.kn.knee
except Exception as e:
    raise Exception()
Output Files:
  • preprocessing_data/K-Means_Elbow.PNG - Elbow plot visualization
The elbow plot helps visualize the point where adding more clusters provides diminishing returns

create_clusters()

Creates clusters using K-Means algorithm and adds cluster labels to the dataset.
create_clusters(data, number_of_clusters)
data
pandas.DataFrame
required
Feature data to be clustered
number_of_clusters
int
required
Number of clusters to create (typically from elbow_plot())
return
pandas.DataFrame
Original DataFrame with additional ‘Cluster’ column containing cluster assignments
Example Usage:
# Step 1: Find optimal clusters
optimal_k = kmeans_clustering.elbow_plot(X_train)

# Step 2: Create clusters
data_with_clusters = kmeans_clustering.create_clusters(X_train, optimal_k)

# Access cluster information
print(data_with_clusters['Cluster'].value_counts())
Implementation:
self.data = data
try:
    self.kmeans = KMeans(
        n_clusters=number_of_clusters, 
        init='k-means++', 
        random_state=42
    )
    
    # Divide data into clusters
    self.y_kmeans = self.kmeans.fit_predict(data)
    
    # Save KMeans model
    self.file_op = file_methods.File_Operation(
        self.file_object, 
        self.logger_object
    )
    self.save_model = self.file_op.save_model(self.kmeans, 'KMeans')
    
    # Add cluster column to dataset
    self.data['Cluster'] = self.y_kmeans
    
    return self.data
except Exception as e:
    raise Exception()
Model Persistence: The trained KMeans model is automatically saved to models/KMeans/ for use during prediction.

Complete Clustering Workflow

Here’s the typical clustering pipeline:
from data_preprocessing.clustering import KMeansClustering
import pandas as pd

# Initialize clustering object
kmeans = KMeansClustering(file_object, logger_object)

# Step 1: Determine optimal number of clusters
print("Finding optimal clusters...")
optimal_clusters = kmeans.elbow_plot(X_preprocessed)
print(f"Optimal clusters: {optimal_clusters}")

# Step 2: Create clusters and add to dataset
print("Creating clusters...")
data_clustered = kmeans.create_clusters(X_preprocessed, optimal_clusters)

# Step 3: Train separate models for each cluster
for cluster_num in data_clustered['Cluster'].unique():
    cluster_data = data_clustered[data_clustered['Cluster'] == cluster_num]
    print(f"Cluster {cluster_num}: {len(cluster_data)} samples")
    # Train model for this cluster
    # ...

Algorithm Details

K-Means++ Initialization

The implementation uses k-means++ initialization, which:
  • Selects initial cluster centers intelligently
  • Reduces the chance of poor clustering
  • Typically converges faster than random initialization

Random State

Fixed random_state=42 ensures:
  • Reproducible results across runs
  • Consistent cluster assignments
  • Easier debugging and comparison

WCSS (Within-Cluster Sum of Squares)

The elbow method uses WCSS (accessed via kmeans.inertia_) which measures:
  • How compact the clusters are
  • Total within-cluster variance
  • Lower WCSS indicates tighter clusters

Use Cases

Fraud Type Segmentation

Group similar fraud patterns together for specialized detection models

Customer Segmentation

Cluster customers by behavior for targeted risk assessment

Model Specialization

Train separate models optimized for each cluster’s characteristics

Anomaly Detection

Identify outliers that don’t fit well into any cluster

Dependencies

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from kneed import KneeLocator
from file_operations import file_methods

Performance Considerations

K-Means complexity is O(n * k * i * d) where:
  • n = number of samples
  • k = number of clusters
  • i = number of iterations
  • d = number of dimensions
For large datasets, consider:
  • Using Mini-Batch K-Means for faster processing
  • Reducing dimensionality with PCA before clustering
  • Limiting the maximum number of iterations

Build docs developers (and LLMs) love