What is the Fraud Detection System?
The fraud detection system is an end-to-end machine learning application designed to identify fraudulent insurance claims. Built with Flask and powered by advanced ML algorithms, it provides automated fraud detection through a REST API interface.The system uses a hybrid approach combining unsupervised clustering (K-Means) with supervised classification (XGBoost and SVM) to achieve high accuracy in fraud detection.
High-Level Workflow
The system operates through two main pipelines:Training Pipeline
- Data Validation - Validates incoming CSV files against schema (
schema_training.json) - Data Preprocessing - Handles missing values, encodes categorical features, scales numerical features
- K-Means Clustering - Groups similar insurance claims using elbow method for optimal clusters
- Model Training - Trains XGBoost and SVM models on each cluster
- Model Selection - Selects best performing model per cluster based on AUC/accuracy scores
- Model Persistence - Saves trained models for prediction
Prediction Pipeline
- Data Validation - Validates prediction data against schema (
schema_prediction.json) - Preprocessing - Applies same transformations as training
- Cluster Assignment - Uses saved K-Means model to assign clusters
- Prediction - Applies cluster-specific best model
- Output Generation - Returns fraud predictions (Y/N) in CSV format
Key Technologies
Flask
Web framework providing REST API endpoints (
/train, /predict) and monitoring dashboard integrationXGBoost
Gradient boosting classifier with hyperparameter tuning via GridSearchCV for binary fraud classification
SVM
Support Vector Machine with RBF/sigmoid kernels, used as alternative classifier in model selection
K-Means
Unsupervised clustering algorithm that segments claims into groups before classification
Additional Libraries
- scikit-learn - Model training, preprocessing, and evaluation
- Pandas & NumPy - Data manipulation and numerical operations
- Flask-MonitoringDashboard - Performance monitoring and API analytics
- imbalanced-learn - Handling imbalanced datasets with RandomOverSampler
Use Cases for Insurance Fraud Detection
Automated Claims Review
Automated Claims Review
Screen incoming insurance claims automatically before manual review, flagging high-risk claims for detailed investigation.
Batch Processing
Batch Processing
Process large batches of historical claims to identify patterns and potentially fraudulent activities that were previously missed.
Risk Scoring
Risk Scoring
Integrate predictions into risk assessment workflows to prioritize investigator resources on claims most likely to be fraudulent.
Pattern Detection
Pattern Detection
Leverage clustering approach to identify unusual claim patterns and emerging fraud schemes across different customer segments.
System Features
- Schema-based Validation - Enforces data quality with JSON schema validation
- Automated Preprocessing - Handles missing values, categorical encoding, and feature scaling
- Cluster-based Modeling - Trains specialized models for different claim segments
- Model Comparison - Automatically selects best performing algorithm (XGBoost vs SVM)
- REST API Interface - Easy integration with existing systems
- Monitoring Dashboard - Track API performance and usage metrics
- Batch Predictions - Process multiple claims efficiently
Production Ready - The system includes comprehensive logging, error handling, and is ready for deployment with Gunicorn/Nginx.
Next Steps
Architecture
Explore the technical architecture and module organization
Data Pipeline
Deep dive into data ingestion and transformation processes
Fraud Detection
Learn about the fraud detection methodology and features