Documentation Index
Fetch the complete documentation index at: https://mintlify.com/sujith52/fraud/llms.txt
Use this file to discover all available pages before exploring further.
Overview
TheKMeansClustering class divides the training data into clusters before model training. This approach allows the system to train specialized models for each data segment, improving overall prediction accuracy.
Class: KMeansClustering
Location:source/data_preprocessing/clustering.py
Version: 1.0
Constructor
File object for logging operations
Logger instance for tracking clustering operations
Methods
elbow_plot()
Determines the optimal number of clusters using the Elbow Method and KneeLocator algorithm.Preprocessed feature data for clustering
Optimal number of clusters determined by the elbow method
- Tests cluster counts from 1 to 10
- Calculates WCSS (Within-Cluster Sum of Squares) for each
- Creates an elbow plot visualization
- Uses KneeLocator to programmatically identify the optimal point
preprocessing_data/K-Means_Elbow.PNG- Elbow plot visualization
create_clusters()
Creates clusters using K-Means algorithm and adds cluster labels to the dataset.Feature data to be clustered
Number of clusters to create (typically from elbow_plot())
Original DataFrame with additional ‘Cluster’ column containing cluster assignments
models/KMeans/ for use during prediction.
Complete Clustering Workflow
Here’s the typical clustering pipeline:Algorithm Details
K-Means++ Initialization
The implementation usesk-means++ initialization, which:
- Selects initial cluster centers intelligently
- Reduces the chance of poor clustering
- Typically converges faster than random initialization
Random State
Fixedrandom_state=42 ensures:
- Reproducible results across runs
- Consistent cluster assignments
- Easier debugging and comparison
WCSS (Within-Cluster Sum of Squares)
The elbow method uses WCSS (accessed viakmeans.inertia_) which measures:
- How compact the clusters are
- Total within-cluster variance
- Lower WCSS indicates tighter clusters
Use Cases
Fraud Type Segmentation
Group similar fraud patterns together for specialized detection models
Customer Segmentation
Cluster customers by behavior for targeted risk assessment
Model Specialization
Train separate models optimized for each cluster’s characteristics
Anomaly Detection
Identify outliers that don’t fit well into any cluster
Dependencies
Performance Considerations
K-Means complexity is O(n * k * i * d) where:
- n = number of samples
- k = number of clusters
- i = number of iterations
- d = number of dimensions
- Using Mini-Batch K-Means for faster processing
- Reducing dimensionality with PCA before clustering
- Limiting the maximum number of iterations