KMeansClustering

Overview

The KMeansClustering class divides the training data into clusters before model training. This approach allows the system to train specialized models for each data segment, improving overall prediction accuracy.

Class: KMeansClustering

Location: source/data_preprocessing/clustering.py Version: 1.0

Constructor

KMeansClustering(file_object, logger_object)

file_object

File

required

File object for logging operations

logger_object

Logger

required

Logger instance for tracking clustering operations

Methods

elbow_plot()

Determines the optimal number of clusters using the Elbow Method and KneeLocator algorithm.

elbow_plot(data)

data

pandas.DataFrame

required

Preprocessed feature data for clustering

return

int

Optimal number of clusters determined by the elbow method

How It Works:

Tests cluster counts from 1 to 10
Calculates WCSS (Within-Cluster Sum of Squares) for each
Creates an elbow plot visualization
Uses KneeLocator to programmatically identify the optimal point

Example Usage:

from data_preprocessing.clustering import KMeansClustering

kmeans_clustering = KMeansClustering(file_object, logger_object)
optimal_clusters = kmeans_clustering.elbow_plot(X_train)

print(f"Optimal number of clusters: {optimal_clusters}")

Implementation:

wcss = []
try:
    for i in range(1, 11):
        kmeans = KMeans(
            n_clusters=i, 
            init='k-means++', 
            random_state=42
        )
        kmeans.fit(data)
        wcss.append(kmeans.inertia_)
    
    plt.plot(range(1, 11), wcss)
    plt.title('The Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.savefig('preprocessing_data/K-Means_Elbow.PNG')
    
    # Find optimal clusters programmatically
    self.kn = KneeLocator(
        range(1, 11), 
        wcss, 
        curve='convex', 
        direction='decreasing'
    )
    
    return self.kn.knee
except Exception as e:
    raise Exception()

Output Files:

preprocessing_data/K-Means_Elbow.PNG - Elbow plot visualization

The elbow plot helps visualize the point where adding more clusters provides diminishing returns

create_clusters()

Creates clusters using K-Means algorithm and adds cluster labels to the dataset.

create_clusters(data, number_of_clusters)

data

pandas.DataFrame

required

Feature data to be clustered

number_of_clusters

int

required

Number of clusters to create (typically from elbow_plot())

return

pandas.DataFrame

Original DataFrame with additional ‘Cluster’ column containing cluster assignments

Example Usage:

# Step 1: Find optimal clusters
optimal_k = kmeans_clustering.elbow_plot(X_train)

# Step 2: Create clusters
data_with_clusters = kmeans_clustering.create_clusters(X_train, optimal_k)

# Access cluster information
print(data_with_clusters['Cluster'].value_counts())

Implementation:

self.data = data
try:
    self.kmeans = KMeans(
        n_clusters=number_of_clusters, 
        init='k-means++', 
        random_state=42
    )
    
    # Divide data into clusters
    self.y_kmeans = self.kmeans.fit_predict(data)
    
    # Save KMeans model
    self.file_op = file_methods.File_Operation(
        self.file_object, 
        self.logger_object
    )
    self.save_model = self.file_op.save_model(self.kmeans, 'KMeans')
    
    # Add cluster column to dataset
    self.data['Cluster'] = self.y_kmeans
    
    return self.data
except Exception as e:
    raise Exception()

Model Persistence: The trained KMeans model is automatically saved to models/KMeans/ for use during prediction.

Complete Clustering Workflow

Here’s the typical clustering pipeline:

from data_preprocessing.clustering import KMeansClustering
import pandas as pd

# Initialize clustering object
kmeans = KMeansClustering(file_object, logger_object)

# Step 1: Determine optimal number of clusters
print("Finding optimal clusters...")
optimal_clusters = kmeans.elbow_plot(X_preprocessed)
print(f"Optimal clusters: {optimal_clusters}")

# Step 2: Create clusters and add to dataset
print("Creating clusters...")
data_clustered = kmeans.create_clusters(X_preprocessed, optimal_clusters)

# Step 3: Train separate models for each cluster
for cluster_num in data_clustered['Cluster'].unique():
    cluster_data = data_clustered[data_clustered['Cluster'] == cluster_num]
    print(f"Cluster {cluster_num}: {len(cluster_data)} samples")
    # Train model for this cluster
    # ...

Algorithm Details

K-Means++ Initialization

The implementation uses k-means++ initialization, which:

Selects initial cluster centers intelligently
Reduces the chance of poor clustering
Typically converges faster than random initialization

Random State

Fixed random_state=42 ensures:

Reproducible results across runs
Consistent cluster assignments
Easier debugging and comparison

WCSS (Within-Cluster Sum of Squares)

The elbow method uses WCSS (accessed via kmeans.inertia_) which measures:

How compact the clusters are
Total within-cluster variance
Lower WCSS indicates tighter clusters

Use Cases

Fraud Type Segmentation

Group similar fraud patterns together for specialized detection models

Customer Segmentation

Cluster customers by behavior for targeted risk assessment

Model Specialization

Train separate models optimized for each cluster’s characteristics

Anomaly Detection

Identify outliers that don’t fit well into any cluster

Dependencies

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from kneed import KneeLocator
from file_operations import file_methods

Performance Considerations

K-Means complexity is O(n * k * i * d) where:

n = number of samples
k = number of clusters
i = number of iterations
d = number of dimensions

For large datasets, consider:

Using Mini-Batch K-Means for faster processing
Reducing dimensionality with PCA before clustering
Limiting the maximum number of iterations

Flask Endpoints

Modules

Overview

Class: KMeansClustering

Constructor

Methods

elbow_plot()

create_clusters()

Complete Clustering Workflow

Algorithm Details

K-Means++ Initialization

Random State

WCSS (Within-Cluster Sum of Squares)

Use Cases

Fraud Type Segmentation

Customer Segmentation

Model Specialization

Anomaly Detection

Dependencies

Performance Considerations

Build docs developers (and LLMs) love

Flask Endpoints

Modules

Documentation Index

​Overview

​Class: KMeansClustering

​Constructor

​Methods

​elbow_plot()

​create_clusters()

​Complete Clustering Workflow

​Algorithm Details

​K-Means++ Initialization

​Random State

​WCSS (Within-Cluster Sum of Squares)

​Use Cases

Fraud Type Segmentation

Customer Segmentation

Model Specialization

Anomaly Detection

​Dependencies

​Performance Considerations

Build docs developers (and LLMs) love

Overview

Class: KMeansClustering

Constructor

Methods

elbow_plot()

create_clusters()

Complete Clustering Workflow

Algorithm Details

K-Means++ Initialization

Random State

WCSS (Within-Cluster Sum of Squares)

Use Cases

Dependencies

Performance Considerations