Overview
The KMeansClustering class divides the training data into clusters before model training. This approach allows the system to train specialized models for each data segment, improving overall prediction accuracy.
Class: KMeansClustering
Location: source/data_preprocessing/clustering.py
Version: 1.0
Constructor
KMeansClustering(file_object, logger_object)
File object for logging operations
Logger instance for tracking clustering operations
Methods
elbow_plot()
Determines the optimal number of clusters using the Elbow Method and KneeLocator algorithm.
Preprocessed feature data for clustering
Optimal number of clusters determined by the elbow method
How It Works:
Tests cluster counts from 1 to 10
Calculates WCSS (Within-Cluster Sum of Squares) for each
Creates an elbow plot visualization
Uses KneeLocator to programmatically identify the optimal point
Example Usage:
from data_preprocessing.clustering import KMeansClustering
kmeans_clustering = KMeansClustering(file_object, logger_object)
optimal_clusters = kmeans_clustering.elbow_plot(X_train)
print ( f "Optimal number of clusters: { optimal_clusters } " )
Implementation:
wcss = []
try :
for i in range ( 1 , 11 ):
kmeans = KMeans(
n_clusters = i,
init = 'k-means++' ,
random_state = 42
)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
plt.plot( range ( 1 , 11 ), wcss)
plt.title( 'The Elbow Method' )
plt.xlabel( 'Number of clusters' )
plt.ylabel( 'WCSS' )
plt.savefig( 'preprocessing_data/K-Means_Elbow.PNG' )
# Find optimal clusters programmatically
self .kn = KneeLocator(
range ( 1 , 11 ),
wcss,
curve = 'convex' ,
direction = 'decreasing'
)
return self .kn.knee
except Exception as e:
raise Exception ()
Output Files:
preprocessing_data/K-Means_Elbow.PNG - Elbow plot visualization
The elbow plot helps visualize the point where adding more clusters provides diminishing returns
create_clusters()
Creates clusters using K-Means algorithm and adds cluster labels to the dataset.
create_clusters(data, number_of_clusters)
Feature data to be clustered
Number of clusters to create (typically from elbow_plot())
Original DataFrame with additional ‘Cluster’ column containing cluster assignments
Example Usage:
# Step 1: Find optimal clusters
optimal_k = kmeans_clustering.elbow_plot(X_train)
# Step 2: Create clusters
data_with_clusters = kmeans_clustering.create_clusters(X_train, optimal_k)
# Access cluster information
print (data_with_clusters[ 'Cluster' ].value_counts())
Implementation:
self .data = data
try :
self .kmeans = KMeans(
n_clusters = number_of_clusters,
init = 'k-means++' ,
random_state = 42
)
# Divide data into clusters
self .y_kmeans = self .kmeans.fit_predict(data)
# Save KMeans model
self .file_op = file_methods.File_Operation(
self .file_object,
self .logger_object
)
self .save_model = self .file_op.save_model( self .kmeans, 'KMeans' )
# Add cluster column to dataset
self .data[ 'Cluster' ] = self .y_kmeans
return self .data
except Exception as e:
raise Exception ()
Model Persistence:
The trained KMeans model is automatically saved to models/KMeans/ for use during prediction.
Complete Clustering Workflow
Here’s the typical clustering pipeline:
from data_preprocessing.clustering import KMeansClustering
import pandas as pd
# Initialize clustering object
kmeans = KMeansClustering(file_object, logger_object)
# Step 1: Determine optimal number of clusters
print ( "Finding optimal clusters..." )
optimal_clusters = kmeans.elbow_plot(X_preprocessed)
print ( f "Optimal clusters: { optimal_clusters } " )
# Step 2: Create clusters and add to dataset
print ( "Creating clusters..." )
data_clustered = kmeans.create_clusters(X_preprocessed, optimal_clusters)
# Step 3: Train separate models for each cluster
for cluster_num in data_clustered[ 'Cluster' ].unique():
cluster_data = data_clustered[data_clustered[ 'Cluster' ] == cluster_num]
print ( f "Cluster { cluster_num } : { len (cluster_data) } samples" )
# Train model for this cluster
# ...
Algorithm Details
K-Means++ Initialization
The implementation uses k-means++ initialization, which:
Selects initial cluster centers intelligently
Reduces the chance of poor clustering
Typically converges faster than random initialization
Random State
Fixed random_state=42 ensures:
Reproducible results across runs
Consistent cluster assignments
Easier debugging and comparison
WCSS (Within-Cluster Sum of Squares)
The elbow method uses WCSS (accessed via kmeans.inertia_) which measures:
How compact the clusters are
Total within-cluster variance
Lower WCSS indicates tighter clusters
Use Cases
Fraud Type Segmentation Group similar fraud patterns together for specialized detection models
Customer Segmentation Cluster customers by behavior for targeted risk assessment
Model Specialization Train separate models optimized for each cluster’s characteristics
Anomaly Detection Identify outliers that don’t fit well into any cluster
Dependencies
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from kneed import KneeLocator
from file_operations import file_methods
K-Means complexity is O(n * k * i * d) where:
n = number of samples
k = number of clusters
i = number of iterations
d = number of dimensions
For large datasets, consider:
Using Mini-Batch K-Means for faster processing
Reducing dimensionality with PCA before clustering
Limiting the maximum number of iterations