Overview
The batch prediction system processes CSV files from the Prediction_Batch_files/ directory and generates fraud predictions for each record. The system uses cluster-specific models to improve prediction accuracy.
Batch File Processing
File Location
Place prediction files in the Prediction_Batch_files/ directory with the following naming convention:
fraudDetection_[DateStamp]_[TimeStamp].csv
Example : fraudDetection_021119920_010222.csv
Files that don’t match the naming convention will be moved to the Bad_Raw folder during validation.
Processing Pipeline
The prediction class handles the complete batch prediction workflow:
class prediction :
def __init__ ( self , path ):
self .file_object = open ( "Prediction_Logs/Prediction_Log.txt" , 'a+' )
self .log_writer = logger.App_Logger()
self .pred_data_val = Prediction_Data_validation(path)
def predictionFromModel ( self ):
# Main prediction workflow
How Predictions are Generated Per Cluster
The system uses a cluster-based approach for improved accuracy:
Load KMeans Model
The pre-trained KMeans clustering model is loaded from the models directory. file_loader = file_methods.File_Operation( self .file_object, self .log_writer)
kmeans = file_loader.load_model( 'KMeans' )
Assign Clusters
Each preprocessed record is assigned to a cluster. clusters = kmeans.predict(data)
data[ 'clusters' ] = clusters
clusters = data[ 'clusters' ].unique()
Iterate Through Clusters
For each unique cluster in the dataset: predictions = []
for i in clusters:
cluster_data = data[data[ 'clusters' ] == i]
cluster_data = cluster_data.drop([ 'clusters' ], axis = 1 )
Load Cluster-Specific Model
Find and load the trained model for the specific cluster. model_name = file_loader.find_correct_model_file(i)
model = file_loader.load_model(model_name)
Generate Predictions
Apply the model and encode results as Y/N. result = model.predict(cluster_data)
for res in result:
if res == 0 :
predictions.append( 'N' )
else :
predictions.append( 'Y' )
Complete Prediction Workflow
Here’s the complete predictionFromModel method from predictFromModel.py:17-76:
def predictionFromModel ( self ):
try :
self .pred_data_val.deletePredictionFile() # deletes existing prediction file
self .log_writer.log( self .file_object, 'Start of Prediction' )
# Load validated data
data_getter = data_loader_prediction.Data_Getter_Pred( self .file_object, self .log_writer)
data = data_getter.get_data()
# Preprocess data
preprocessor = preprocessing.Preprocessor( self .file_object, self .log_writer)
data = preprocessor.remove_columns(data,
[ 'policy_number' , 'policy_bind_date' , 'policy_state' ,
'insured_zip' , 'incident_location' , 'incident_date' ,
'incident_state' , 'incident_city' , 'insured_hobbies' ,
'auto_make' , 'auto_model' , 'auto_year' , 'age' ,
'total_claim_amount' ])
data.replace( '?' , np.NaN, inplace = True )
# Handle missing values
is_null_present, cols_with_missing_values = preprocessor.is_null_present(data)
if (is_null_present):
data = preprocessor.impute_missing_values(data, cols_with_missing_values)
# Encode and scale
data = preprocessor.encode_categorical_columns(data)
data = preprocessor.scale_numerical_columns(data)
# Load clustering model
file_loader = file_methods.File_Operation( self .file_object, self .log_writer)
kmeans = file_loader.load_model( 'KMeans' )
# Assign clusters and predict
clusters = kmeans.predict(data)
data[ 'clusters' ] = clusters
clusters = data[ 'clusters' ].unique()
predictions = []
for i in clusters:
cluster_data = data[data[ 'clusters' ] == i]
cluster_data = cluster_data.drop([ 'clusters' ], axis = 1 )
model_name = file_loader.find_correct_model_file(i)
model = file_loader.load_model(model_name)
result = model.predict(cluster_data)
for res in result:
if res == 0 :
predictions.append( 'N' )
else :
predictions.append( 'Y' )
# Save predictions
final = pd.DataFrame( list ( zip (predictions)), columns = [ 'Predictions' ])
path = "Prediction_Output_File/Predictions.csv"
final.to_csv( "Prediction_Output_File/Predictions.csv" , header = True , mode = 'a+' )
self .log_writer.log( self .file_object, 'End of Prediction' )
except Exception as ex:
self .log_writer.log( self .file_object,
'Error occured while running the prediction!! Error:: %s ' % ex)
raise ex
return path
Output File Location
Predictions are saved to:
Prediction_Output_File/Predictions.csv
The system automatically deletes any existing prediction file before generating new predictions to avoid confusion from previous runs.
Data Preprocessing Steps
Before prediction, the system performs several preprocessing steps:
Removes columns that don’t contribute to prediction:
policy_number, policy_bind_date, policy_state
insured_zip, insured_hobbies
incident_location, incident_date, incident_state, incident_city
auto_make, auto_model, auto_year
age, total_claim_amount
Replaces ’?’ characters with NaN
Detects columns with missing values
Applies appropriate imputation strategies
Encodes categorical variables into numerical format for model compatibility.
Scales numerical features to standardize value ranges.
Logging
All prediction activities are logged to:
Prediction_Logs/Prediction_Log.txt
Logs include:
Start and end of prediction process
Data loading status
Preprocessing steps
Model loading events
Any errors or exceptions
Next Steps
Prediction Overview Understand the complete prediction workflow
Output Format Learn how to interpret prediction results