Insurance Fraud Overview
Insurance fraud costs the industry billions annually. This system detects fraudulent claims by analyzing patterns in claim characteristics, policyholder information, and incident details.The system uses a clustering + classification approach to handle heterogeneous claim types and improve detection accuracy across different customer segments.
Fraud Indicators
The model learns to identify fraud based on these categories of indicators:Policyholder Profile Indicators
Customer Tenure
Feature:
months_as_customerNew customers may have different fraud patterns than long-term policyholders. Short tenure combined with high-value claims can indicate fraud.Education Level
Feature:
insured_education_levelEncoded: JD(1), High School(2), College(3), Masters(4), Associate(5), MD(6), PhD(7)Correlates with claim patterns and fraud likelihood.Occupation
Feature:
insured_occupationOne-hot encoded. Certain occupations may correlate with specific fraud schemes.Relationship Status
Feature:
insured_relationshipFamily structure can influence claim behavior and fraud patterns.Policy Characteristics
Coverage Limits (policy_csl)
Coverage Limits (policy_csl)
Combined Single Limit - Encoded as 1, 2.5, 5 for 100/300, 250/500, 500/1000Higher coverage limits combined with maximum claims can indicate staged fraud.
Deductible (policy_deductable)
Deductible (policy_deductable)
Low deductibles with frequent claims may signal fraud, while high deductibles with maximum claims warrant scrutiny.
Premium (policy_annual_premium)
Premium (policy_annual_premium)
Umbrella Limit (umbrella_limit)
Umbrella Limit (umbrella_limit)
Additional coverage amount. Can interact with other policy features to identify suspicious patterns.
Incident Characteristics
- Incident Details
- Timing & Location
- Involvement Details
- Response
incident_type - Type of incident (one-hot encoded)
- Single Vehicle Collision
- Multi-vehicle Collision
- Vehicle Theft
- Parked Car
- Front Collision
- Rear Collision
- Side Collision
- Trivial Damage (1)
- Minor Damage (2)
- Major Damage (3)
- Total Loss (4)
Claim Financial Indicators
- Maximum claims across all categories
- Round number amounts (e.g., exactly $10,000)
- Claims disproportionate to incident severity
- Financial distress (high capital losses) combined with severe claims
Features Used for Detection
After preprocessing, the model trains on approximately 35-40 features (exact count depends on one-hot encoding cardinality):Numerical Features (12)
StandardScaler (mean=0, std=1)
Reference: data_preprocessing/preprocessing.py:170-174
Encoded Categorical Features
Label Encoded (6):policy_csl(1, 2.5, 5)insured_education_level(1-7)incident_severity(1-4)insured_sex(0-1)property_damage(0-1)police_report_available(0-1)
insured_occupation→ ~10-15 featuresinsured_relationship→ ~5 featuresincident_type→ ~3 featurescollision_type→ ~2 featuresauthorities_contacted→ ~3 features
data_preprocessing/preprocessing.py:191-237
Model Approach: Clustering + Classification
The system uses a two-stage approach:Stage 1: K-Means Clustering
Purpose: Segment claims into homogeneous groups before classification Algorithm: K-Means with k-means++ initialization Optimal Cluster Selection:data_preprocessing/clustering.py:19-47
Stage 2: Supervised Classification
For each cluster, the system trains and compares two algorithms:- XGBoost
- SVM (Support Vector Machine)
Algorithm: Extreme Gradient BoostingHyperparameters Tuned:Grid Search: 5-fold cross-validationObjective:
binary:logistic (fraud vs. non-fraud)Advantages:- Handles non-linear relationships
- Built-in regularization
- Feature importance scores
- Robust to imbalanced data
best_model_finder/tuner.py:66-107Model Selection Process
best_model_finder/tuner.py:117-165
Evaluation Metric: ROC-AUC is used as the primary metric because it’s robust to class imbalance (fraud cases are typically rare). Accuracy is used as a fallback when only one class is present in the test set.
Training Workflow per Cluster
Cluster Assignment
After K-Means clustering, each data point is assigned a cluster ID (0, 1, 2, etc.)
trainingModel.py:72-93
Prediction Interpretation
The system outputs binary fraud predictions:Prediction Process
predictFromModel.py:57-67
Output Format
File:Prediction_Output_File/Predictions.csv
- N - Claim predicted as NOT FRAUDULENT (model output = 0)
- Y - Claim predicted as FRAUDULENT (model output = 1)
Interpretation Guidelines
Y - Fraudulent Prediction
Y - Fraudulent Prediction
Action Required:
- Flag for manual investigation
- Request additional documentation
- Verify witness statements
- Check for previous fraud history
- Validate claim amounts against incident severity
N - Non-Fraudulent Prediction
N - Non-Fraudulent Prediction
Processing:
- Proceed with standard claim processing
- May still undergo random audits
- Not guaranteed legitimate (false negatives possible)
Handling Class Imbalance
Fraud cases are typically rare (often less than 10% of claims). The system addresses this:During Training (Optional)
ThePreprocessor class includes a handle_imbalanced_dataset() method:
data_preprocessing/preprocessing.py:239-266
This method is defined but not currently used in the main training pipeline. It’s available for future implementation if class imbalance becomes problematic.
Built-in Robustness
- XGBoost: Inherently handles imbalance through weighted loss functions
- ROC-AUC Metric: Robust to class imbalance, evaluates across all thresholds
- Clustering: Naturally segments data, potentially creating more balanced cluster-specific datasets
Model Performance Considerations
Cluster-Specific Models
Each cluster gets a specialized model, improving accuracy for heterogeneous claims compared to single global model.
Hyperparameter Tuning
GridSearchCV with 5-fold CV ensures robust parameter selection and reduces overfitting.
Algorithm Diversity
Comparing XGBoost (tree-based) vs SVM (kernel-based) increases likelihood of finding optimal approach per cluster.
Automated Selection
Best model per cluster is selected automatically based on test set performance, no manual intervention needed.
Fraud Detection Best Practices
Next Steps
System Overview
Review the high-level system workflow and use cases
Architecture
Explore the technical implementation and module organization
Data Pipeline
Deep dive into data preprocessing and validation