Fraud Detection Methodology - Fraud Detection System

Insurance Fraud Overview

Insurance fraud costs the industry billions annually. This system detects fraudulent claims by analyzing patterns in claim characteristics, policyholder information, and incident details.

The system uses a clustering + classification approach to handle heterogeneous claim types and improve detection accuracy across different customer segments.

Fraud Indicators

The model learns to identify fraud based on these categories of indicators:

Policyholder Profile Indicators

Customer Tenure

Feature: months_as_customerNew customers may have different fraud patterns than long-term policyholders. Short tenure combined with high-value claims can indicate fraud.

Education Level

Feature: insured_education_levelEncoded: JD(1), High School(2), College(3), Masters(4), Associate(5), MD(6), PhD(7)Correlates with claim patterns and fraud likelihood.

Occupation

Feature: insured_occupationOne-hot encoded. Certain occupations may correlate with specific fraud schemes.

Relationship Status

Feature: insured_relationshipFamily structure can influence claim behavior and fraud patterns.

Policy Characteristics

Coverage Limits (policy_csl)

Combined Single Limit - Encoded as 1, 2.5, 5 for 100/300, 250/500, 500/1000Higher coverage limits combined with maximum claims can indicate staged fraud.

Deductible (policy_deductable)

Low deductibles with frequent claims may signal fraud, while high deductibles with maximum claims warrant scrutiny.

Premium (policy_annual_premium)

Relationship between premium paid and claims filed. Abnormally high claims relative to premium can indicate fraud.

Umbrella Limit (umbrella_limit)

Additional coverage amount. Can interact with other policy features to identify suspicious patterns.

Incident Characteristics

Incident Details
Timing & Location
Involvement Details
Response

incident_type - Type of incident (one-hot encoded)

Single Vehicle Collision
Multi-vehicle Collision
Vehicle Theft
Parked Car

collision_type - Nature of collision

Front Collision
Rear Collision
Side Collision

incident_severity - Damage level (1-4)

Trivial Damage (1)
Minor Damage (2)
Major Damage (3)
Total Loss (4)

Mismatch between severity and claim amounts is a red flag.

incident_hour_of_the_day - Hour when incident occurred (0-23)Certain hours may correlate with staged accidents (e.g., late night in isolated areas).Note: incident_date, incident_state, incident_city, incident_location are removed during preprocessing due to high cardinality and potential overfitting.

Claim Financial Indicators

# Financial features used in the model
financial_features = [
    'injury_claim',    # Amount claimed for injuries
    'property_claim',  # Amount claimed for property damage
    'vehicle_claim',   # Amount claimed for vehicle damage
    'capital-gains',   # Policyholder capital gains
    'capital-loss'     # Policyholder capital losses
]

total_claim_amount is removed during preprocessing to prevent data leakage, as it’s the sum of injury_claim + property_claim + vehicle_claim.

Fraud Patterns:

Maximum claims across all categories
Round number amounts (e.g., exactly $10,000)
Claims disproportionate to incident severity
Financial distress (high capital losses) combined with severe claims

Features Used for Detection

After preprocessing, the model trains on approximately 35-40 features (exact count depends on one-hot encoding cardinality):

Numerical Features (12)

scaled_numerical_features = [
    'months_as_customer',           # Tenure indicator
    'policy_deductable',            # Risk threshold
    'umbrella_limit',               # Additional coverage
    'capital-gains',                # Financial status
    'capital-loss',                 # Financial stress
    'incident_hour_of_the_day',     # Temporal pattern
    'number_of_vehicles_involved',  # Incident complexity
    'bodily_injuries',              # Severity proxy
    'witnesses',                    # Verification potential
    'injury_claim',                 # Claim component
    'property_claim',               # Claim component
    'vehicle_claim'                 # Claim component
]

All scaled using StandardScaler (mean=0, std=1) Reference: data_preprocessing/preprocessing.py:170-174

Encoded Categorical Features

Label Encoded (6):

policy_csl (1, 2.5, 5)
insured_education_level (1-7)
incident_severity (1-4)
insured_sex (0-1)
property_damage (0-1)
police_report_available (0-1)

One-Hot Encoded (Variable):

insured_occupation → ~10-15 features
insured_relationship → ~5 features
incident_type → ~3 features
collision_type → ~2 features
authorities_contacted → ~3 features

Reference: data_preprocessing/preprocessing.py:191-237

Model Approach: Clustering + Classification

The system uses a two-stage approach:

Stage 1: K-Means Clustering

Purpose: Segment claims into homogeneous groups before classification Algorithm: K-Means with k-means++ initialization Optimal Cluster Selection:

# Elbow method with automated knee detection
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(X)  # Features only, no labels
    wcss.append(kmeans.inertia_)

# KneeLocator finds the elbow point
kn = KneeLocator(range(1, 11), wcss, curve='convex', direction='decreasing')
optimal_k = kn.knee

Reference: data_preprocessing/clustering.py:19-47

Why Clustering? Insurance claims are heterogeneous. Clustering groups similar claims (e.g., minor vs. major incidents, different demographics) allowing specialized models per segment.

Stage 2: Supervised Classification

For each cluster, the system trains and compares two algorithms:

XGBoost
SVM (Support Vector Machine)

Algorithm: Extreme Gradient BoostingHyperparameters Tuned:

param_grid = {
    'n_estimators': [100, 130],
    'criterion': ['gini', 'entropy'],
    'max_depth': range(8, 10, 1)
}

Grid Search: 5-fold cross-validationObjective: binary:logistic (fraud vs. non-fraud)Advantages:

Handles non-linear relationships
Built-in regularization
Feature importance scores
Robust to imbalanced data

Reference: best_model_finder/tuner.py:66-107

Algorithm: Support Vector ClassificationHyperparameters Tuned:

param_grid = {
    'kernel': ['rbf', 'sigmoid'],
    'C': [0.1, 0.5, 1.0],
    'random_state': [0, 100, 200, 300]
}

Grid Search: 5-fold cross-validationAdvantages:

Effective in high-dimensional spaces
Memory efficient (uses support vectors)
Versatile kernels for non-linear boundaries

Reference: best_model_finder/tuner.py:20-64

Model Selection Process

def get_best_model(train_x, train_y, test_x, test_y):
    # Train both models
    xgboost = get_best_params_for_xgboost(train_x, train_y)
    svm = get_best_params_for_svm(train_x, train_y)
    
    # Evaluate on test set
    pred_xgb = xgboost.predict(test_x)
    pred_svm = svm.predict(test_x)
    
    # Score using ROC-AUC (or accuracy if single class)
    if len(test_y.unique()) == 1:
        score_xgb = accuracy_score(test_y, pred_xgb)
        score_svm = accuracy_score(test_y, pred_svm)
    else:
        score_xgb = roc_auc_score(test_y, pred_xgb)
        score_svm = roc_auc_score(test_y, pred_svm)
    
    # Return best model
    if score_xgb > score_svm:
        return 'XGBoost', xgboost
    else:
        return 'SVM', svm

Reference: best_model_finder/tuner.py:117-165

Evaluation Metric: ROC-AUC is used as the primary metric because it’s robust to class imbalance (fraud cases are typically rare). Accuracy is used as a fallback when only one class is present in the test set.

Training Workflow per Cluster

Cluster Assignment

After K-Means clustering, each data point is assigned a cluster ID (0, 1, 2, etc.)

Cluster Isolation

For cluster i, extract all rows: cluster_data = X[X['Cluster'] == i]

Feature-Label Split

cluster_features = cluster_data.drop(['Labels', 'Cluster'], axis=1)
cluster_label = cluster_data['Labels']  # fraud_reported: 0 or 1

Train-Test Split

x_train, x_test, y_train, y_test = train_test_split(
    cluster_features, cluster_label, 
    test_size=1/3, random_state=355
)

67% training, 33% testing

Scaling

Apply StandardScaler to numerical features in both train and test sets

Model Training & Selection

Train XGBoost and SVM, select best based on AUC score

Model Persistence

Save best model as {ModelName}{ClusterID} (e.g., XGBoost0, SVM1)

Reference: trainingModel.py:72-93

Prediction Interpretation

The system outputs binary fraud predictions:

Prediction Process

for i in clusters:
    cluster_data = data[data['clusters'] == i]
    cluster_data = cluster_data.drop(['clusters'], axis=1)
    
    # Load cluster-specific best model
    model_name = file_loader.find_correct_model_file(i)
    model = file_loader.load_model(model_name)
    
    # Predict
    result = model.predict(cluster_data)
    
    # Map to Y/N
    for res in result:
        if res == 0:
            predictions.append('N')  # Not Fraud
        else:
            predictions.append('Y')  # Fraud

Reference: predictFromModel.py:57-67

Output Format

File: Prediction_Output_File/Predictions.csv

Predictions
N
Y
N
N
Y

N - Claim predicted as NOT FRAUDULENT (model output = 0)
Y - Claim predicted as FRAUDULENT (model output = 1)

Interpretation Guidelines

Y - Fraudulent Prediction

Action Required:

Flag for manual investigation
Request additional documentation
Verify witness statements
Check for previous fraud history
Validate claim amounts against incident severity

Not Automatic Rejection: Human review required to confirm fraud.

N - Non-Fraudulent Prediction

Processing:

Proceed with standard claim processing
May still undergo random audits
Not guaranteed legitimate (false negatives possible)

Risk Tiering: Consider implementing confidence scores for borderline cases.

Model Limitations:

Predictions are probabilistic, not definitive
False positives/negatives will occur
Novel fraud schemes may not be detected
Model should be retrained periodically with new data

Handling Class Imbalance

Fraud cases are typically rare (often less than 10% of claims). The system addresses this:

During Training (Optional)

The Preprocessor class includes a handle_imbalanced_dataset() method:

from imblearn.over_sampling import RandomOverSampler

rdsmple = RandomOverSampler()
x_sampled, y_sampled = rdsmple.fit_sample(x, y)

Reference: data_preprocessing/preprocessing.py:239-266

This method is defined but not currently used in the main training pipeline. It’s available for future implementation if class imbalance becomes problematic.

Built-in Robustness

XGBoost: Inherently handles imbalance through weighted loss functions
ROC-AUC Metric: Robust to class imbalance, evaluates across all thresholds
Clustering: Naturally segments data, potentially creating more balanced cluster-specific datasets

Model Performance Considerations

Cluster-Specific Models

Each cluster gets a specialized model, improving accuracy for heterogeneous claims compared to single global model.

Hyperparameter Tuning

GridSearchCV with 5-fold CV ensures robust parameter selection and reduces overfitting.

Algorithm Diversity

Comparing XGBoost (tree-based) vs SVM (kernel-based) increases likelihood of finding optimal approach per cluster.

Automated Selection

Best model per cluster is selected automatically based on test set performance, no manual intervention needed.

Fraud Detection Best Practices

Model Monitoring:

Track prediction distribution (% flagged as fraud)
Monitor false positive/negative rates via claim audits
Retrain quarterly or when fraud patterns shift
Log feature importance to understand decision factors

Human-in-the-Loop:

Use predictions as decision support, not automatic rejection
Maintain investigation team for flagged claims
Collect feedback to improve model over time
Document model limitations for compliance

Continuous Improvement:

Incorporate new fraud schemes into training data
A/B test model versions before deployment
Consider ensemble methods combining multiple models
Add explainability tools (SHAP, LIME) for transparency

Next Steps

System Overview

Review the high-level system workflow and use cases

Architecture

Explore the technical implementation and module organization

Data Pipeline

Deep dive into data preprocessing and validation

Get Started

Core Concepts

Training

Prediction

Documentation Index

​Insurance Fraud Overview

​Fraud Indicators

​Policyholder Profile Indicators

Customer Tenure

Education Level

Occupation

Relationship Status

​Policy Characteristics

​Incident Characteristics

​Claim Financial Indicators

​Features Used for Detection

​Numerical Features (12)

​Encoded Categorical Features

​Model Approach: Clustering + Classification

​Stage 1: K-Means Clustering

​Stage 2: Supervised Classification

​Model Selection Process

​Training Workflow per Cluster

​Prediction Interpretation

​Prediction Process

​Output Format

​Interpretation Guidelines

​Handling Class Imbalance

​During Training (Optional)

​Built-in Robustness

​Model Performance Considerations

Cluster-Specific Models

Hyperparameter Tuning

Algorithm Diversity

Automated Selection

​Fraud Detection Best Practices

​Next Steps

System Overview

Architecture

Data Pipeline

Build docs developers (and LLMs) love

Insurance Fraud Overview

Fraud Indicators

Policyholder Profile Indicators

Policy Characteristics

Incident Characteristics

Claim Financial Indicators

Features Used for Detection

Numerical Features (12)

Encoded Categorical Features

Model Approach: Clustering + Classification

Stage 1: K-Means Clustering

Stage 2: Supervised Classification

Model Selection Process

Training Workflow per Cluster

Prediction Interpretation

Prediction Process

Output Format

Interpretation Guidelines

Handling Class Imbalance

During Training (Optional)

Built-in Robustness

Model Performance Considerations

Fraud Detection Best Practices

Next Steps