Overview
The fraud detection system requires training data in a specific CSV format with 39 columns. This guide covers the input requirements, schema structure, and file naming conventions.Input Data Format
Training data must be provided as CSV files following these requirements:All CSV files must contain exactly 39 columns matching the schema definition.
File Naming Convention
Files must follow this pattern:fraudDetection_021119920_010222.csv
- Prefix:
fraudDetection_ - Date stamp: 9 characters (e.g.,
021119920) - Time stamp: 6 characters (e.g.,
010222) - Extension:
.csv
Schema Definition
The system usesschema_training.json to validate incoming data:
Required Columns (39 Total)
The dataset must include these columns with their respective data types:Customer Information
Number of months the customer has been with the insurance company
Age of the insured person
ZIP code of the insured
Gender of the insured (MALE/FEMALE)
Education level (JD, High School, College, Masters, Associate, MD, PhD)
Occupation of the insured
Hobbies of the insured
Relationship status
Policy Information
Unique policy identifier
Date when policy was bound
State where policy is issued
Combined single limit (100/300, 250/500, 500/1000)
Policy deductible amount
Annual premium amount
Umbrella policy limit
Financial Information
Capital gains of the insured
Capital losses of the insured
Total claim amount
Injury claim amount
Property damage claim amount
Vehicle damage claim amount
Incident Information
Date of the incident
Type of incident
Type of collision
Severity level (Trivial Damage, Minor Damage, Major Damage, Total Loss)
Which authorities were contacted
State where incident occurred
City where incident occurred
Specific location of incident
Hour when incident occurred (0-23)
Number of vehicles in the incident
Whether property damage occurred (YES/NO)
Number of bodily injuries
Number of witnesses
Whether police report is available (YES/NO)
Vehicle Information
Make of the vehicle
Model of the vehicle
Year of the vehicle
Target Variable
Whether fraud was reported (Y/N) - this is the target variable for training
Sample Data Structure
Here’s an example of how your CSV data should be structured:Data Validation
After preparing your data files:- Place them in the
Training_Batch_Files/directory - The validation process will automatically:
- Check file naming conventions
- Verify column count (must be 39)
- Validate data types
- Move valid files to
Good_Raw/folder - Move invalid files to
Bad_Raw/folder
Missing values are acceptable and will be handled during preprocessing. Use
? or leave cells empty for missing values.Next Steps
Once your data is prepared:Common Issues
- File rejected during validation: Check that filename matches the exact pattern
- Column count mismatch: Ensure CSV has all 39 columns in the correct order
- Data type errors: Verify that numeric fields contain only numbers (except for missing values)