Overview
The Preprocessor class handles all data transformation steps required before model training. This includes handling missing values, encoding categorical features, scaling numerical features, and removing unnecessary columns.
Preprocessor Class
Implemented in data_preprocessing/preprocessing.py:
from sklearn_pandas import CategoricalImputer
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
class Preprocessor:
def __init__(self, file_object, logger_object):
self.file_object = file_object
self.logger_object = logger_object
Preprocessing Pipeline
Remove Unwanted Columns
Drop columns that don’t contribute to prediction
Check for Missing Values
Identify columns with missing data
Impute Missing Values
Fill missing values using CategoricalImputer
Encode Categorical Features
Convert categorical variables to numerical format
Separate Features and Labels
Split data into X (features) and Y (target)
Scale Numerical Columns
Standardize numerical features
Column Removal
Columns that don’t contribute to fraud detection are removed:
def remove_columns(self, data, columns):
self.logger_object.log(self.file_object,
'Entered the remove_columns method of the Preprocessor class')
self.data = data
self.columns = columns
try:
# Drop the specified columns
self.useful_data = self.data.drop(labels=self.columns, axis=1)
self.logger_object.log(self.file_object,
'Column removal Successful.Exited the remove_columns method of the Preprocessor class')
return self.useful_data
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in remove_columns method of the Preprocessor class. Exception message: ' + str(e))
raise Exception()
Columns Removed During Training
From trainingModel.py:40, these columns are dropped:
data = preprocessor.remove_columns(data, [
'policy_number', # Unique identifier, no predictive value
'policy_bind_date', # Date field, redundant
'policy_state', # High cardinality
'insured_zip', # High cardinality
'incident_location', # High cardinality
'incident_date', # Date field, redundant
'incident_state', # High cardinality
'incident_city', # High cardinality
'insured_hobbies', # High cardinality
'auto_make', # High cardinality
'auto_model', # High cardinality
'auto_year', # Redundant with age
'age', # Redundant with months_as_customer
'total_claim_amount' # Target leakage
])
High cardinality columns are removed to prevent overfitting and reduce model complexity.
Missing Value Detection
Identify which columns contain missing values:
def is_null_present(self, data):
self.logger_object.log(self.file_object,
'Entered the is_null_present method of the Preprocessor class')
self.null_present = False
self.cols_with_missing_values = []
self.cols = data.columns
try:
# Check for the count of null values per column
self.null_counts = data.isna().sum()
for i in range(len(self.null_counts)):
if self.null_counts[i] > 0:
self.null_present = True
self.cols_with_missing_values.append(self.cols[i])
# Write the logs to see which columns have null values
if(self.null_present):
self.dataframe_with_null = pd.DataFrame()
self.dataframe_with_null['columns'] = data.columns
self.dataframe_with_null['missing values count'] = np.asarray(data.isna().sum())
# Store the null column information to file
self.dataframe_with_null.to_csv('preprocessing_data/null_values.csv')
self.logger_object.log(self.file_object,
'Finding missing values is a success.Data written to the null values file. Exited the is_null_present method of the Preprocessor class')
return self.null_present, self.cols_with_missing_values
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in is_null_present method of the Preprocessor class. Exception message: ' + str(e))
raise Exception()
Output: Creates preprocessing_data/null_values.csv with missing value counts per column.
Missing Value Imputation
Uses CategoricalImputer to fill missing values:
def impute_missing_values(self, data, cols_with_missing_values):
self.logger_object.log(self.file_object,
'Entered the impute_missing_values method of the Preprocessor class')
self.data = data
self.cols_with_missing_values = cols_with_missing_values
try:
self.imputer = CategoricalImputer()
# Impute each column with missing values
for col in self.cols_with_missing_values:
self.data[col] = self.imputer.fit_transform(self.data[col])
self.logger_object.log(self.file_object,
'Imputing missing values Successful. Exited the impute_missing_values method of the Preprocessor class')
return self.data
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in impute_missing_values method of the Preprocessor class. Exception message: ' + str(e))
raise Exception()
CategoricalImputer fills missing values with the most frequent value (mode) in each column. This works for both categorical and numerical features.
Categorical Encoding
Converts categorical variables to numerical format using two approaches:
1. Label Encoding (Ordinal Features)
For features with inherent order:
def encode_categorical_columns(self, data):
self.logger_object.log(self.file_object,
'Entered the encode_categorical_columns method of the Preprocessor class')
self.data = data
try:
# Select categorical columns
self.cat_df = self.data.select_dtypes(include=['object']).copy()
# Label encoding for ordinal features
self.cat_df['policy_csl'] = self.cat_df['policy_csl'].map({
'100/300': 1,
'250/500': 2.5,
'500/1000': 5
})
self.cat_df['insured_education_level'] = self.cat_df['insured_education_level'].map({
'JD': 1,
'High School': 2,
'College': 3,
'Masters': 4,
'Associate': 5,
'MD': 6,
'PhD': 7
})
self.cat_df['incident_severity'] = self.cat_df['incident_severity'].map({
'Trivial Damage': 1,
'Minor Damage': 2,
'Major Damage': 3,
'Total Loss': 4
})
# Binary encoding
self.cat_df['insured_sex'] = self.cat_df['insured_sex'].map({
'FEMALE': 0,
'MALE': 1
})
self.cat_df['property_damage'] = self.cat_df['property_damage'].map({
'NO': 0,
'YES': 1
})
self.cat_df['police_report_available'] = self.cat_df['police_report_available'].map({
'NO': 0,
'YES': 1
})
try:
# Code block for training (includes target variable)
self.cat_df['fraud_reported'] = self.cat_df['fraud_reported'].map({
'N': 0,
'Y': 1
})
self.cols_to_drop = ['policy_csl', 'insured_education_level', 'incident_severity',
'insured_sex', 'property_damage', 'police_report_available',
'fraud_reported']
except:
# Code block for Prediction (no target variable)
self.cols_to_drop = ['policy_csl', 'insured_education_level', 'incident_severity',
'insured_sex', 'property_damage', 'police_report_available']
2. One-Hot Encoding (Nominal Features)
For features without inherent order:
# Using dummy encoding for remaining categorical columns
for col in self.cat_df.drop(columns=self.cols_to_drop).columns:
self.cat_df = pd.get_dummies(self.cat_df, columns=[col], prefix=[col], drop_first=True)
# Replace original categorical columns with encoded versions
self.data.drop(columns=self.data.select_dtypes(include=['object']).columns, inplace=True)
self.data = pd.concat([self.cat_df, self.data], axis=1)
self.logger_object.log(self.file_object,
'encoding for categorical values successful. Exited the encode_categorical_columns method of the Preprocessor class')
return self.data
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in encode_categorical_columns method of the Preprocessor class. Exception message: ' + str(e))
raise Exception()
drop_first=True in one-hot encoding prevents multicollinearity by dropping one category per feature.
Feature-Label Separation
Split data into features (X) and target (Y):
def separate_label_feature(self, data, label_column_name):
self.logger_object.log(self.file_object,
'Entered the separate_label_feature method of the Preprocessor class')
try:
# Drop the target column to get features
self.X = data.drop(labels=label_column_name, axis=1)
# Filter the Label column
self.Y = data[label_column_name]
self.logger_object.log(self.file_object,
'Label Separation Successful. Exited the separate_label_feature method of the Preprocessor class')
return self.X, self.Y
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in separate_label_feature method of the Preprocessor class. Exception message: ' + str(e))
raise Exception()
Numerical Feature Scaling
Standardize numerical features using StandardScaler:
def scale_numerical_columns(self, data):
self.logger_object.log(self.file_object,
'Entered the scale_numerical_columns method of the Preprocessor class')
self.data = data
# Select numerical columns for scaling
self.num_df = self.data[[
'months_as_customer',
'policy_deductable',
'umbrella_limit',
'capital-gains',
'capital-loss',
'incident_hour_of_the_day',
'number_of_vehicles_involved',
'bodily_injuries',
'witnesses',
'injury_claim',
'property_claim',
'vehicle_claim'
]]
try:
# Apply StandardScaler
self.scaler = StandardScaler()
self.scaled_data = self.scaler.fit_transform(self.num_df)
self.scaled_num_df = pd.DataFrame(data=self.scaled_data,
columns=self.num_df.columns,
index=self.data.index)
# Replace original numerical columns with scaled versions
self.data.drop(columns=self.scaled_num_df.columns, inplace=True)
self.data = pd.concat([self.scaled_num_df, self.data], axis=1)
self.logger_object.log(self.file_object,
'scaling for numerical values successful. Exited the scale_numerical_columns method of the Preprocessor class')
return self.data
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in scale_numerical_columns method of the Preprocessor class. Exception message: ' + str(e))
raise Exception()
StandardScaler transforms features to have mean=0 and standard deviation=1, which improves model convergence and performance.
Handling Imbalanced Data
Use Random Over Sampling to balance the target classes:
def handle_imbalanced_dataset(self, x, y):
self.logger_object.log(self.file_object,
'Entered the handle_imbalanced_dataset method of the Preprocessor class')
try:
self.rdsmple = RandomOverSampler()
self.x_sampled, self.y_sampled = self.rdsmple.fit_sample(x, y)
self.logger_object.log(self.file_object,
'dataset balancing successful. Exited the handle_imbalanced_dataset method of the Preprocessor class')
return self.x_sampled, self.y_sampled
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in handle_imbalanced_dataset method of the Preprocessor class. Exception message: ' + str(e))
raise Exception()
RandomOverSampler duplicates minority class samples to balance the dataset. This helps prevent models from being biased toward the majority class.
Complete Preprocessing Example
From trainingModel.py, here’s how preprocessing is applied:
preprocessor = preprocessing.Preprocessor(self.file_object, self.log_writer)
# 1. Remove unwanted columns
data = preprocessor.remove_columns(data, ['policy_number', 'policy_bind_date', ...])
# 2. Replace '?' with NaN
data.replace('?', np.NaN, inplace=True)
# 3. Check for missing values
is_null_present, cols_with_missing_values = preprocessor.is_null_present(data)
# 4. Impute missing values if present
if (is_null_present):
data = preprocessor.impute_missing_values(data, cols_with_missing_values)
# 5. Encode categorical data
data = preprocessor.encode_categorical_columns(data)
# 6. Separate features and labels
X, Y = preprocessor.separate_label_feature(data, label_column_name='fraud_reported')
# 7. Scale numerical columns (done per cluster after split)
x_train = preprocessor.scale_numerical_columns(x_train)
x_test = preprocessor.scale_numerical_columns(x_test)
Next Steps
After preprocessing:
- Data is ready for clustering
- Features are in numerical format
- Missing values are handled
- Numerical features are scaled
Proceed to clustering to create cluster-specific models.