Batch Prediction

Overview

The batch prediction system processes CSV files from the Prediction_Batch_files/ directory and generates fraud predictions for each record. The system uses cluster-specific models to improve prediction accuracy.

Batch File Processing

File Location

Place prediction files in the Prediction_Batch_files/ directory with the following naming convention:

fraudDetection_[DateStamp]_[TimeStamp].csv

Example: fraudDetection_021119920_010222.csv

Files that don’t match the naming convention will be moved to the Bad_Raw folder during validation.

Processing Pipeline

The prediction class handles the complete batch prediction workflow:

class prediction:
    def __init__(self, path):
        self.file_object = open("Prediction_Logs/Prediction_Log.txt", 'a+')
        self.log_writer = logger.App_Logger()
        self.pred_data_val = Prediction_Data_validation(path)

    def predictionFromModel(self):
        # Main prediction workflow

How Predictions are Generated Per Cluster

The system uses a cluster-based approach for improved accuracy:

Load KMeans Model

The pre-trained KMeans clustering model is loaded from the models directory.

file_loader = file_methods.File_Operation(self.file_object, self.log_writer)
kmeans = file_loader.load_model('KMeans')

Assign Clusters

Each preprocessed record is assigned to a cluster.

clusters = kmeans.predict(data)
data['clusters'] = clusters
clusters = data['clusters'].unique()

Iterate Through Clusters

For each unique cluster in the dataset:

predictions = []
for i in clusters:
    cluster_data = data[data['clusters'] == i]
    cluster_data = cluster_data.drop(['clusters'], axis=1)

Load Cluster-Specific Model

Find and load the trained model for the specific cluster.

model_name = file_loader.find_correct_model_file(i)
model = file_loader.load_model(model_name)

Generate Predictions

Apply the model and encode results as Y/N.

result = model.predict(cluster_data)
for res in result:
    if res == 0:
        predictions.append('N')
    else:
        predictions.append('Y')

Complete Prediction Workflow

Here’s the complete predictionFromModel method from predictFromModel.py:17-76:

def predictionFromModel(self):
    try:
        self.pred_data_val.deletePredictionFile()  # deletes existing prediction file
        self.log_writer.log(self.file_object, 'Start of Prediction')
        
        # Load validated data
        data_getter = data_loader_prediction.Data_Getter_Pred(self.file_object, self.log_writer)
        data = data_getter.get_data()

        # Preprocess data
        preprocessor = preprocessing.Preprocessor(self.file_object, self.log_writer)
        data = preprocessor.remove_columns(data,
                                           ['policy_number', 'policy_bind_date', 'policy_state', 
                                            'insured_zip', 'incident_location', 'incident_date', 
                                            'incident_state', 'incident_city', 'insured_hobbies', 
                                            'auto_make', 'auto_model', 'auto_year', 'age',
                                            'total_claim_amount'])
        
        data.replace('?', np.NaN, inplace=True)

        # Handle missing values
        is_null_present, cols_with_missing_values = preprocessor.is_null_present(data)
        if (is_null_present):
            data = preprocessor.impute_missing_values(data, cols_with_missing_values)
        
        # Encode and scale
        data = preprocessor.encode_categorical_columns(data)
        data = preprocessor.scale_numerical_columns(data)

        # Load clustering model
        file_loader = file_methods.File_Operation(self.file_object, self.log_writer)
        kmeans = file_loader.load_model('KMeans')

        # Assign clusters and predict
        clusters = kmeans.predict(data)
        data['clusters'] = clusters
        clusters = data['clusters'].unique()
        
        predictions = []
        for i in clusters:
            cluster_data = data[data['clusters'] == i]
            cluster_data = cluster_data.drop(['clusters'], axis=1)
            model_name = file_loader.find_correct_model_file(i)
            model = file_loader.load_model(model_name)
            result = model.predict(cluster_data)
            
            for res in result:
                if res == 0:
                    predictions.append('N')
                else:
                    predictions.append('Y')

        # Save predictions
        final = pd.DataFrame(list(zip(predictions)), columns=['Predictions'])
        path = "Prediction_Output_File/Predictions.csv"
        final.to_csv("Prediction_Output_File/Predictions.csv", header=True, mode='a+')
        
        self.log_writer.log(self.file_object, 'End of Prediction')
    except Exception as ex:
        self.log_writer.log(self.file_object, 
                           'Error occured while running the prediction!! Error:: %s' % ex)
        raise ex
    return path

Output File Location

Predictions are saved to:

Prediction_Output_File/Predictions.csv

The system automatically deletes any existing prediction file before generating new predictions to avoid confusion from previous runs.

Data Preprocessing Steps

Before prediction, the system performs several preprocessing steps:

Column Removal

Removes columns that don’t contribute to prediction:

policy_number, policy_bind_date, policy_state
insured_zip, insured_hobbies
incident_location, incident_date, incident_state, incident_city
auto_make, auto_model, auto_year
age, total_claim_amount

Missing Value Handling

Replaces ’?’ characters with NaN
Detects columns with missing values
Applies appropriate imputation strategies

Feature Encoding

Encodes categorical variables into numerical format for model compatibility.

Feature Scaling

Scales numerical features to standardize value ranges.

Logging

All prediction activities are logged to:

Prediction_Logs/Prediction_Log.txt

Logs include:

Start and end of prediction process
Data loading status
Preprocessing steps
Model loading events
Any errors or exceptions

Get Started

Core Concepts

Training

Prediction

Overview

Batch File Processing

File Location

Processing Pipeline

How Predictions are Generated Per Cluster

Complete Prediction Workflow

Output File Location

Data Preprocessing Steps

Logging

Next Steps

Prediction Overview

Output Format

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training

Prediction

Documentation Index

​Overview

​Batch File Processing

​File Location

​Processing Pipeline

​How Predictions are Generated Per Cluster

​Complete Prediction Workflow

​Output File Location

​Data Preprocessing Steps

​Logging

​Next Steps

Prediction Overview

Output Format

Build docs developers (and LLMs) love

Overview

Batch File Processing

File Location

Processing Pipeline

How Predictions are Generated Per Cluster

Complete Prediction Workflow

Output File Location

Data Preprocessing Steps

Logging

Next Steps