Data Preparation

Overview

The fraud detection system requires training data in a specific CSV format with 39 columns. This guide covers the input requirements, schema structure, and file naming conventions.

Input Data Format

Training data must be provided as CSV files following these requirements:

All CSV files must contain exactly 39 columns matching the schema definition.

File Naming Convention

Files must follow this pattern:

fraudDetection_[DATESTAMP]_[TIMESTAMP].csv

Example: fraudDetection_021119920_010222.csv

Prefix: fraudDetection_
Date stamp: 9 characters (e.g., 021119920)
Time stamp: 6 characters (e.g., 010222)
Extension: .csv

Files that don’t match this naming pattern will be rejected and moved to the Bad_Raw folder during validation.

Schema Definition

The system uses schema_training.json to validate incoming data:

{
  "SampleFileName": "fraudDetection_021119920_010222.csv",
  "LengthOfDateStampInFile": 9,
  "LengthOfTimeStampInFile": 6,
  "NumberofColumns": 39
}

Required Columns (39 Total)

The dataset must include these columns with their respective data types:

Customer Information

months_as_customer

Integer

Number of months the customer has been with the insurance company

age

Integer

Age of the insured person

insured_zip

Integer

ZIP code of the insured

insured_sex

varchar

Gender of the insured (MALE/FEMALE)

insured_education_level

varchar

Education level (JD, High School, College, Masters, Associate, MD, PhD)

insured_occupation

varchar

Occupation of the insured

insured_hobbies

varchar

Hobbies of the insured

insured_relationship

varchar

Relationship status

Policy Information

policy_number

Integer

Unique policy identifier

policy_bind_date

varchar

Date when policy was bound

policy_state

varchar

State where policy is issued

policy_csl

varchar

Combined single limit (100/300, 250/500, 500/1000)

policy_deductable

Integer

Policy deductible amount

policy_annual_premium

Integer

Annual premium amount

umbrella_limit

Integer

Umbrella policy limit

Financial Information

capital-gains

Integer

Capital gains of the insured

capital-loss

Integer

Capital losses of the insured

total_claim_amount

Integer

Total claim amount

injury_claim

Integer

Injury claim amount

property_claim

Integer

Property damage claim amount

vehicle_claim

Integer

Vehicle damage claim amount

Incident Information

incident_date

varchar

Date of the incident

incident_type

varchar

Type of incident

collision_type

varchar

Type of collision

incident_severity

varchar

Severity level (Trivial Damage, Minor Damage, Major Damage, Total Loss)

authorities_contacted

varchar

Which authorities were contacted

incident_state

varchar

State where incident occurred

incident_city

varchar

City where incident occurred

incident_location

varchar

Specific location of incident

incident_hour_of_the_day

Integer

Hour when incident occurred (0-23)

number_of_vehicles_involved

Integer

Number of vehicles in the incident

property_damage

varchar

Whether property damage occurred (YES/NO)

bodily_injuries

Integer

Number of bodily injuries

witnesses

Integer

Number of witnesses

police_report_available

varchar

Whether police report is available (YES/NO)

Vehicle Information

auto_make

varchar

Make of the vehicle

auto_model

varchar

Model of the vehicle

auto_year

Integer

Year of the vehicle

Target Variable

fraud_reported

varchar

Whether fraud was reported (Y/N) - this is the target variable for training

Sample Data Structure

Here’s an example of how your CSV data should be structured:

months_as_customer,age,policy_number,policy_bind_date,policy_state,policy_csl,...,fraud_reported
328,48,521585,2014-10-17,OH,250/500,...,Y
228,42,342868,2006-04-03,IN,250/500,...,Y
134,29,687698,2000-06-22,OH,100/300,...,N

Data Validation

After preparing your data files:

Place them in the Training_Batch_Files/ directory
The validation process will automatically:
- Check file naming conventions
- Verify column count (must be 39)
- Validate data types
- Move valid files to Good_Raw/ folder
- Move invalid files to Bad_Raw/ folder

Missing values are acceptable and will be handled during preprocessing. Use ? or leave cells empty for missing values.

Next Steps

Once your data is prepared:

Validate Data

Run the data validation process to ensure files meet requirements

Review Validation Logs

Check Training_Logs/ for any validation errors

Preprocess Data

Move to the preprocessing stage for feature engineering

Common Issues

File rejected during validation: Check that filename matches the exact pattern
Column count mismatch: Ensure CSV has all 39 columns in the correct order
Data type errors: Verify that numeric fields contain only numbers (except for missing values)

Get Started

Core Concepts

Training

Prediction

Overview

Input Data Format

File Naming Convention

Schema Definition

Required Columns (39 Total)

Customer Information

Policy Information

Financial Information

Incident Information

Vehicle Information

Target Variable

Sample Data Structure

Data Validation

Next Steps

Common Issues

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training

Prediction

Documentation Index

​Overview

​Input Data Format

​File Naming Convention

​Schema Definition

​Required Columns (39 Total)

​Customer Information

​Policy Information

​Financial Information

​Incident Information

​Vehicle Information

​Target Variable

​Sample Data Structure

​Data Validation

​Next Steps

​Common Issues

Build docs developers (and LLMs) love

Overview

Input Data Format

File Naming Convention

Schema Definition

Required Columns (39 Total)

Customer Information

Policy Information

Financial Information

Incident Information

Vehicle Information

Target Variable

Sample Data Structure

Data Validation

Next Steps

Common Issues