Data Ingestion Overview
The data pipeline handles CSV file ingestion, schema validation, data transformation, and quality assurance for both training and prediction workflows.The system uses separate schemas for training (
schema_training.json) and prediction (schema_prediction.json) to account for the presence/absence of the target variable fraud_reported.Data Sources
Training Data
Input Location:Training_FileFromDB/InputFile.csv
Schema Reference: schema_training.json
Expected Columns: 39 (including fraud_reported target)
Implementation:
data_ingestion/data_loader.py:16-30
Prediction Data
Input Location: Specified via API request filepath Schema Reference:schema_prediction.json
Expected Columns: 38 (excluding fraud_reported)
Implementation:
Schema Validation
Training Schema Structure
Theschema_training.json file defines the expected data format:
schema_training.json
Validation Rules
File Naming Convention
File Naming Convention
Files must follow the pattern:
fraudDetection_{DateStamp}_{TimeStamp}.csv- DateStamp Length: 9 characters
- TimeStamp Length: 6 characters
- Example:
fraudDetection_021119920_010222.csv
Column Count Validation
Column Count Validation
- Training: Must have exactly 39 columns
- Prediction: Must have exactly 38 columns (no
fraud_reported)
Data Type Validation
Data Type Validation
Each column’s data type is validated against the schema:
- Integer: Numeric values (can contain NaN)
- varchar: String values (categorical or text)
Schema Consistency
Schema Consistency
Column names must exactly match the schema definition. Missing or extra columns cause validation failure.
Validation Modules
- Training Validation
- Prediction Validation
Module:
DataTypeValidation_Insertion_Training/Workflow:- Read schema from
schema_training.json - Validate file name format
- Check column count (39 expected)
- Validate data types per column
- Move valid files to processing queue
- Archive bad data to
TrainingArchiveBadData/
training_Validation_Insertion.pyData Transformation Steps
The preprocessing pipeline applies identical transformations to training and prediction data:Step 1: Column Removal
Method:Preprocessor.remove_columns()
Columns Removed: 14 features that don’t contribute to prediction
trainingModel.py:40, predictFromModel.py:30-34
Step 2: Missing Value Handling
Method:Preprocessor.is_null_present() → Preprocessor.impute_missing_values()
Process:
- Detection: Identify columns with missing values (NaN)
- Logging: Save null value report to
preprocessing_data/null_values.csv - Imputation: Use
CategoricalImputerfor all columns with missing data
data_preprocessing/preprocessing.py:97-155
Step 3: Categorical Encoding
Method:Preprocessor.encode_categorical_columns()
Encoding Strategy: Hybrid approach using label encoding + one-hot encoding
Label Encoding (Ordinal Features)
data_preprocessing/preprocessing.py:207-217
One-Hot Encoding (Nominal Features)
Remaining categorical columns are one-hot encoded:insured_occupationinsured_relationshipincident_typecollision_typeauthorities_contacted
data_preprocessing/preprocessing.py:226-227
drop_first=True prevents multicollinearity by dropping one category per feature.Step 4: Feature Scaling
Method:Preprocessor.scale_numerical_columns()
Scaler: StandardScaler (z-score normalization)
Numerical Features Scaled:
data_preprocessing/preprocessing.py:156-190
Step 5: Label Separation (Training Only)
Method:Preprocessor.separate_label_feature()
data_preprocessing/preprocessing.py:74-95
Data Quality Checks
Missing Value Report
Output:preprocessing_data/null_values.csv
Content:
data_preprocessing/preprocessing.py:121-124
Elbow Plot
Output:preprocessing_data/K-Means_Elbow.PNG
Purpose: Visual confirmation of optimal cluster count
Method: K-Means WCSS (Within-Cluster Sum of Squares) analysis
data_preprocessing/clustering.py:19-47
Output Formats
Training Output
Model Files: Saved tomodels/ directory (via File_Operation)
Naming Convention:
- K-Means model:
KMeans - Cluster models:
XGBoost0,SVM1, etc.
Prediction Output
File:Prediction_Output_File/Predictions.csv
Format:
N- Not fraudulent (model output = 0)Y- Fraudulent (model output = 1)
predictFromModel.py:64-71
The prediction file is created with
mode='a+' which appends results. The deletePredictionFile() method ensures old predictions are cleared before new runs.Data Pipeline Error Handling
Logging Strategy
All pipeline operations are logged viaApp_Logger:
- Training:
Training_Logs/ModelTrainingLog.txt - Prediction:
Prediction_Logs/Prediction_Log.txt
Bad Data Handling
Training: Files failing validation are moved toTrainingArchiveBadData/ with subdirectories by date
Prediction: Bad files are logged and rejected without archiving
Performance Considerations
Batch Processing
CSV files are processed in batch mode, allowing efficient handling of thousands of claims simultaneously.
Memory Management
Pandas DataFrames are used throughout, providing efficient memory usage for tabular data operations.
Preprocessing Efficiency
StandardScaler and one-hot encoding are vectorized operations, ensuring fast transformation.
Model Loading
Models are loaded once per prediction request and cached within the request lifecycle.
Next Steps
Fraud Detection
Learn about the fraud detection methodology and model approach
Architecture
Explore the technical architecture and system components