K-Means Clustering

Overview

The fraud detection system uses K-Means clustering to divide the training data into groups before model training. This approach allows the system to train separate models for each cluster, improving prediction accuracy by capturing different fraud patterns.

Why Use Clustering?

Clustering provides several benefits:

Specialized Models: Each cluster gets its own model optimized for specific fraud patterns
Better Performance: Models trained on similar data perform better than one-size-fits-all models
Pattern Recognition: Different clusters may represent different types of insurance claims
Scalability: New clusters can be added as fraud patterns evolve

Instead of training one model for all data, we train multiple specialized models, each expert in detecting fraud for a specific type of claim.

KMeansClustering Class

Implemented in data_preprocessing/clustering.py:

from sklearn.cluster import KMeans
from kneed import KneeLocator
import matplotlib.pyplot as plt

class KMeansClustering:
    def __init__(self, file_object, logger_object):
        self.file_object = file_object
        self.logger_object = logger_object

Clustering Workflow

Find Optimal Clusters

Use the elbow method to determine the best number of clusters

Create Clusters

Apply K-Means to divide data into clusters

Save Cluster Model

Persist the K-Means model for prediction

Add Cluster Labels

Add cluster assignments as a new column to the dataset

Elbow Method

The elbow method determines the optimal number of clusters by analyzing the Within-Cluster Sum of Squares (WCSS):

def elbow_plot(self, data):
    self.logger_object.log(self.file_object, 
                          'Entered the elbow_plot method of the KMeansClustering class')
    wcss = []  # initializing an empty list
    
    try:
        # Test cluster counts from 1 to 10
        for i in range(1, 11):
            # Initialize KMeans with i clusters
            kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
            # Fit the data
            kmeans.fit(data)
            # Store the inertia (WCSS)
            wcss.append(kmeans.inertia_)
        
        # Create the graph between WCSS and number of clusters
        plt.plot(range(1, 11), wcss)
        plt.title('The Elbow Method')
        plt.xlabel('Number of clusters')
        plt.ylabel('WCSS')
        plt.savefig('preprocessing_data/K-Means_Elbow.PNG')
        
        # Find the value of the optimum cluster programmatically
        self.kn = KneeLocator(range(1, 11), wcss, curve='convex', direction='decreasing')
        
        self.logger_object.log(self.file_object, 
                              'The optimum number of clusters is: ' + str(self.kn.knee) + 
                              ' . Exited the elbow_plot method of the KMeansClustering class')
        return self.kn.knee
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in elbow_plot method of the KMeansClustering class. Exception message: ' + str(e))
        raise Exception()

How It Works

Test Multiple Cluster Counts: Run K-Means with 1 to 10 clusters
Calculate WCSS: For each cluster count, calculate the within-cluster sum of squares
Plot Results: Create a plot showing WCSS vs. number of clusters
Find Elbow: Use KneeLocator to programmatically find the “elbow” point

WCSS (Within-Cluster Sum of Squares): Measures the compactness of clusters. Lower values indicate tighter clusters.The “elbow” is the point where adding more clusters doesn’t significantly reduce WCSS.

K-Means++ Initialization

kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)

init=‘k-means++’: Smart initialization that spreads out initial centroids
random_state=42: Ensures reproducible results

Output

Creates preprocessing_data/K-Means_Elbow.PNG showing the elbow plot:

WCSS
  |
  |\___
  |    \___
  |        \___
  |            \______
  +---------------------> Number of Clusters
  1  2  3  4  5  6  7  8  9  10
       ^elbow

Creating Clusters

Once the optimal number of clusters is determined, apply K-Means:

def create_clusters(self, data, number_of_clusters):
    self.logger_object.log(self.file_object, 
                          'Entered the create_clusters method of the KMeansClustering class')
    self.data = data
    
    try:
        # Initialize KMeans with optimal cluster count
        self.kmeans = KMeans(n_clusters=number_of_clusters, 
                            init='k-means++', 
                            random_state=42)
        
        # Divide data into clusters
        self.y_kmeans = self.kmeans.fit_predict(data)
        
        # Save the KMeans model to directory
        self.file_op = file_methods.File_Operation(self.file_object, self.logger_object)
        self.save_model = self.file_op.save_model(self.kmeans, 'KMeans')
        
        # Create a new column in dataset for storing the cluster information
        self.data['Cluster'] = self.y_kmeans
        
        self.logger_object.log(self.file_object, 
                              'succesfully created ' + str(self.kn.knee) + 
                              ' clusters. Exited the create_clusters method of the KMeansClustering class')
        return self.data
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in create_clusters method of the KMeansClustering class. Exception message: ' + str(e))
        raise Exception()

Key Operations

fit_predict(): Fits K-Means and assigns cluster labels in one step
Save Model: Persists the trained K-Means model for use during prediction
Add Cluster Column: Adds cluster assignments to the dataset

The K-Means model must be saved because prediction data will need to be assigned to the same clusters.

Integration with Training Pipeline

From trainingModel.py:56-68, here’s how clustering is used:

# Apply the clustering approach
kmeans = clustering.KMeansClustering(self.file_object, self.log_writer)

# Using the elbow plot to find the number of optimum clusters
number_of_clusters = kmeans.elbow_plot(X)

# Divide the data into clusters
X = kmeans.create_clusters(X, number_of_clusters)

# Create a new column consisting of the corresponding cluster assignments
X['Labels'] = Y

# Get the unique clusters from our dataset
list_of_clusters = X['Cluster'].unique()

Training Per Cluster

After clustering, separate models are trained for each cluster:

# Parse all the clusters and look for the best ML algorithm for each
for i in list_of_clusters:
    # Filter the data for one cluster
    cluster_data = X[X['Cluster'] == i]
    
    # Prepare the feature and label columns
    cluster_features = cluster_data.drop(['Labels', 'Cluster'], axis=1)
    cluster_label = cluster_data['Labels']
    
    # Split into training and test set for each cluster
    x_train, x_test, y_train, y_test = train_test_split(
        cluster_features, cluster_label, test_size=1/3, random_state=355
    )
    
    # Get the best model for this cluster
    model_finder = tuner.Model_Finder(self.file_object, self.log_writer)
    best_model_name, best_model = model_finder.get_best_model(
        x_train, y_train, x_test, y_test
    )
    
    # Save the best model for this cluster
    file_op = file_methods.File_Operation(self.file_object, self.log_writer)
    save_model = file_op.save_model(best_model, best_model_name + str(i))

Each cluster gets:

Its own train-test split
Its own model selection process
Its own saved model file (e.g., XGBoost0, SVM1)

Clustering Example

Suppose the elbow method determines 3 optimal clusters:

Cluster	Characteristics	Model Type	Use Case
0	High-value claims, severe damage	XGBoost	Detects sophisticated fraud in expensive claims
1	Low-value claims, minor damage	SVM	Identifies patterns in small fraudulent claims
2	Medium-value claims, mixed severity	XGBoost	General fraud detection

Benefits of Cluster-Based Training

Pattern Specialization

Each model learns fraud patterns specific to its cluster type

Improved Accuracy

Models perform better on similar data than on diverse data

Reduced False Positives

Specialized models make fewer mistakes on their cluster type

Interpretability

Easier to understand why a claim was flagged when you know its cluster

Prediction with Clusters

During prediction:

Load the saved K-Means model
Assign new data to existing clusters using predict()
Route each prediction to its cluster-specific model
Combine predictions from all clusters

# Load saved KMeans model
kmeans_model = file_op.load_model('KMeans')

# Assign clusters to new data
clusters = kmeans_model.predict(new_data)

# Get predictions from cluster-specific models
for i in unique_clusters:
    cluster_data = new_data[clusters == i]
    model = file_op.load_model(f'{best_model_name}{i}')
    predictions[i] = model.predict(cluster_data)

Next Steps

After clustering:

Data is divided into optimal clusters
Each cluster is ready for independent model training
The K-Means model is saved for prediction

Proceed to model training to train models for each cluster.

Get Started

Core Concepts

Training

Prediction

Overview

Why Use Clustering?

KMeansClustering Class

Clustering Workflow

Elbow Method

How It Works

K-Means++ Initialization

Output

Creating Clusters

Key Operations

Integration with Training Pipeline

Training Per Cluster

Clustering Example

Benefits of Cluster-Based Training

Prediction with Clusters

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training

Prediction

Documentation Index

​Overview

​Why Use Clustering?

​KMeansClustering Class

​Clustering Workflow

​Elbow Method

​How It Works

​K-Means++ Initialization

​Output

​Creating Clusters

​Key Operations

​Integration with Training Pipeline

​Training Per Cluster

​Clustering Example

​Benefits of Cluster-Based Training

​Prediction with Clusters

​Next Steps

Build docs developers (and LLMs) love

Overview

Why Use Clustering?

KMeansClustering Class

Clustering Workflow

Elbow Method

How It Works

K-Means++ Initialization

Output

Creating Clusters

Key Operations

Integration with Training Pipeline

Training Per Cluster

Clustering Example

Benefits of Cluster-Based Training

Prediction with Clusters

Next Steps