Skip to main content

How Prediction Works

The fraud detection system generates predictions for new insurance claims data through a multi-stage pipeline that validates, preprocesses, clusters, and applies trained models to identify potential fraud.

Input Requirements

Prediction data must meet the following requirements:

File Format

CSV files with a specific naming convention: fraudDetection_[DateStamp]_[TimeStamp].csv

Column Count

Must contain exactly 38 columns as defined in the prediction schema

File Location

Files must be placed in Prediction_Batch_files/ directory

Schema Compliance

All columns must match the data types defined in schema_prediction.json

Required Input Fields

The system expects 38 fields including:
  • Customer Information: months_as_customer, age, insured_sex, insured_education_level, insured_occupation, insured_relationship
  • Policy Details: policy_number, policy_bind_date, policy_state, policy_csl, policy_deductable, policy_annual_premium, umbrella_limit, insured_zip
  • Incident Information: incident_date, incident_type, collision_type, incident_severity, authorities_contacted, incident_state, incident_city, incident_location, incident_hour_of_the_day
  • Claim Details: total_claim_amount, injury_claim, property_claim, vehicle_claim, number_of_vehicles_involved, property_damage, bodily_injuries, witnesses, police_report_available
  • Vehicle Information: auto_make, auto_model, auto_year
  • Financial Data: capital-gains, capital-loss

Prediction Flow

The prediction process follows a structured pipeline:
1

Data Validation

Validates file names, column counts, and data quality. Files are sorted into Good_Raw and Bad_Raw folders.Learn more about data validation
2

Data Loading

Validated files are loaded into a temporary database and exported as a consolidated CSV file (Prediction_FileFromDB/InputFile.csv).
3

Data Preprocessing

  • Removes non-predictive columns (policy_number, dates, location details, etc.)
  • Replaces ’?’ with NaN for missing value handling
  • Imputes missing values using appropriate strategies
  • Encodes categorical variables
  • Scales numerical features
4

Clustering

Applies the pre-trained KMeans model to assign each record to a cluster. This enables cluster-specific model selection.
5

Model Loading & Prediction

For each cluster:
  • Loads the appropriate trained model for that cluster
  • Generates predictions (0 = not fraud, 1 = fraud)
  • Encodes results as Y (fraud) or N (not fraud)
6

Output Generation

Predictions are compiled into a CSV file saved to Prediction_Output_File/Predictions.csv.Learn more about output format

Output Format

The system generates a CSV file with a single column:
  • Predictions: Contains ‘Y’ (fraud detected) or ‘N’ (no fraud) for each input record
The prediction process deletes any existing prediction file from previous runs to ensure fresh results.

Error Handling

All prediction operations are logged to Prediction_Logs/Prediction_Log.txt. Errors during prediction:
  • Are logged with detailed exception messages
  • Halt the prediction process
  • Raise exceptions for upstream handling

Next Steps

Batch Prediction

Learn how to process batch files for prediction

Data Validation

Understand validation requirements and rules

Output Format

Learn how to interpret prediction results

Build docs developers (and LLMs) love