Skip to main content

Overview

The fraud detection system requires training data in a specific CSV format with 39 columns. This guide covers the input requirements, schema structure, and file naming conventions.

Input Data Format

Training data must be provided as CSV files following these requirements:
All CSV files must contain exactly 39 columns matching the schema definition.

File Naming Convention

Files must follow this pattern:
fraudDetection_[DATESTAMP]_[TIMESTAMP].csv
Example: fraudDetection_021119920_010222.csv
  • Prefix: fraudDetection_
  • Date stamp: 9 characters (e.g., 021119920)
  • Time stamp: 6 characters (e.g., 010222)
  • Extension: .csv
Files that don’t match this naming pattern will be rejected and moved to the Bad_Raw folder during validation.

Schema Definition

The system uses schema_training.json to validate incoming data:
{
  "SampleFileName": "fraudDetection_021119920_010222.csv",
  "LengthOfDateStampInFile": 9,
  "LengthOfTimeStampInFile": 6,
  "NumberofColumns": 39
}

Required Columns (39 Total)

The dataset must include these columns with their respective data types:

Customer Information

months_as_customer
Integer
Number of months the customer has been with the insurance company
age
Integer
Age of the insured person
insured_zip
Integer
ZIP code of the insured
insured_sex
varchar
Gender of the insured (MALE/FEMALE)
insured_education_level
varchar
Education level (JD, High School, College, Masters, Associate, MD, PhD)
insured_occupation
varchar
Occupation of the insured
insured_hobbies
varchar
Hobbies of the insured
insured_relationship
varchar
Relationship status

Policy Information

policy_number
Integer
Unique policy identifier
policy_bind_date
varchar
Date when policy was bound
policy_state
varchar
State where policy is issued
policy_csl
varchar
Combined single limit (100/300, 250/500, 500/1000)
policy_deductable
Integer
Policy deductible amount
policy_annual_premium
Integer
Annual premium amount
umbrella_limit
Integer
Umbrella policy limit

Financial Information

capital-gains
Integer
Capital gains of the insured
capital-loss
Integer
Capital losses of the insured
total_claim_amount
Integer
Total claim amount
injury_claim
Integer
Injury claim amount
property_claim
Integer
Property damage claim amount
vehicle_claim
Integer
Vehicle damage claim amount

Incident Information

incident_date
varchar
Date of the incident
incident_type
varchar
Type of incident
collision_type
varchar
Type of collision
incident_severity
varchar
Severity level (Trivial Damage, Minor Damage, Major Damage, Total Loss)
authorities_contacted
varchar
Which authorities were contacted
incident_state
varchar
State where incident occurred
incident_city
varchar
City where incident occurred
incident_location
varchar
Specific location of incident
incident_hour_of_the_day
Integer
Hour when incident occurred (0-23)
number_of_vehicles_involved
Integer
Number of vehicles in the incident
property_damage
varchar
Whether property damage occurred (YES/NO)
bodily_injuries
Integer
Number of bodily injuries
witnesses
Integer
Number of witnesses
police_report_available
varchar
Whether police report is available (YES/NO)

Vehicle Information

auto_make
varchar
Make of the vehicle
auto_model
varchar
Model of the vehicle
auto_year
Integer
Year of the vehicle

Target Variable

fraud_reported
varchar
Whether fraud was reported (Y/N) - this is the target variable for training

Sample Data Structure

Here’s an example of how your CSV data should be structured:
months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,...,fraud_reported
328,48,521585,2014-10-17,OH,250/500,...,Y
228,42,342868,2006-04-03,IN,250/500,...,Y
134,29,687698,2000-06-22,OH,100/300,...,N

Data Validation

After preparing your data files:
  1. Place them in the Training_Batch_Files/ directory
  2. The validation process will automatically:
    • Check file naming conventions
    • Verify column count (must be 39)
    • Validate data types
    • Move valid files to Good_Raw/ folder
    • Move invalid files to Bad_Raw/ folder
Missing values are acceptable and will be handled during preprocessing. Use ? or leave cells empty for missing values.

Next Steps

Once your data is prepared:
1

Validate Data

Run the data validation process to ensure files meet requirements
2

Review Validation Logs

Check Training_Logs/ for any validation errors
3

Preprocess Data

Move to the preprocessing stage for feature engineering

Common Issues

  • File rejected during validation: Check that filename matches the exact pattern
  • Column count mismatch: Ensure CSV has all 39 columns in the correct order
  • Data type errors: Verify that numeric fields contain only numbers (except for missing values)

Build docs developers (and LLMs) love