System Components
The fraud detection system follows a modular architecture with clear separation of concerns:Core Application Layer
Flask API Server
Location:
main.pyEntry point handling HTTP requests, routing, and response formatting. Runs on port 5001 with Flask-MonitoringDashboard integration.Training Engine
Location:
trainingModel.pyOrchestrates the complete training pipeline from data loading through model persistence. Implements the trainModel class.Prediction Engine
Location:
predictFromModel.pyHandles prediction requests, loads appropriate models, and generates fraud predictions. Implements the prediction class.Validation Layer
Location:
training_Validation_Insertion.py, prediction_Validation_Insertion.pyValidates incoming data against schemas before processing enters training or prediction pipelines.Data Processing Modules
Data Ingestion
Location:data_ingestion/
data_loader.py- Loads training data fromTraining_FileFromDB/InputFile.csvdata_loader_prediction.py- Loads prediction data from validated sources
Data Preprocessing
Location:data_preprocessing/
-
preprocessing.py- ThePreprocessorclass handles:- Column removal (irrelevant features)
- Missing value imputation using
CategoricalImputer - Categorical encoding (label encoding + one-hot encoding)
- Feature scaling with
StandardScaler - Label separation
- Dataset balancing with
RandomOverSampler
-
clustering.py- TheKMeansClusteringclass provides:- Elbow plot generation for optimal cluster count
- K-Means model training and prediction
- Cluster assignment to data points
Model Selection
Location:best_model_finder/
tuner.py - The Model_Finder class implements:
- SVM Hyperparameters: kernel (rbf, sigmoid), C (0.1, 0.5, 1.0), random_state
- XGBoost Hyperparameters: n_estimators (100, 130), criterion (gini, entropy), max_depth (8-10)
- Selection Metric: ROC-AUC score (falls back to accuracy for single-class scenarios)
File Operations
Location:file_operations/
file_methods.py - The File_Operation class manages:
- Model serialization and deserialization
- Finding correct model files for cluster assignments
- Model versioning per cluster
Validation & Transformation Modules
- Training Path
- Prediction Path
- DataTypeValidation_Insertion_Training/ - Schema validation against
schema_training.json - DataTransform_Training/ - Data transformation specific to training
- Training_Raw_data_validation/ - Raw data quality checks
- Training_Batch_Files/ - Batch file storage
- TrainingArchiveBadData/ - Archive for rejected data
Logging & Monitoring
Location:application_logging/
logger.py - The App_Logger class provides:
- Centralized logging across all modules
- Training logs in
Training_Logs/ModelTrainingLog.txt - Prediction logs in
Prediction_Logs/Prediction_Log.txt - Structured log format for debugging and auditing
Data Flow
Training Flow
Columns Removed During Preprocessing:
policy_number, policy_bind_date, policy_state, insured_zip, incident_location, incident_date, incident_state, incident_city, insured_hobbies, auto_make, auto_model, auto_year, age, total_claim_amountThese are removed as they don’t contribute to prediction or may cause overfitting.Prediction Flow
Module Organization
The source code in~/workspace/source/ follows this structure:
Training and Prediction Pipeline Interaction
Model Training Phase
The training pipeline creates cluster-specific models. For example, if K-Means identifies 3 optimal clusters, the system trains 6 models total:
XGBoost0,SVM0(cluster 0) → best savedXGBoost1,SVM1(cluster 1) → best savedXGBoost2,SVM2(cluster 2) → best saved
Model Persistence
The
File_Operation.save_model() method saves:- K-Means model as
KMeans - Best classifier per cluster as
{ModelName}{ClusterID}
Prediction Phase
The prediction pipeline:
- Loads the saved K-Means model
- Assigns each prediction row to a cluster
- Loads the corresponding best model for that cluster
- Generates predictions using the cluster-specific model
API Endpoints
POST /train
Request:main.py:64-92
POST /predict
Request:main.py:25-60
GET /dashboard
Access Flask-MonitoringDashboard for performance metrics and API monitoring.Deployment Considerations
Production Deployment:
- Use Gunicorn as WSGI server (see
Procfile) - Deploy behind Nginx reverse proxy
- Configure environment variables for paths and ports
- Enable CORS for cross-origin requests (already configured)
Next Steps
Data Pipeline
Learn about data ingestion and schema validation
Fraud Detection
Explore fraud detection methodology and features