Skip to main content

Overview

The Preprocessor class handles all data transformation steps required before model training. This includes handling missing values, encoding categorical features, scaling numerical features, and removing unnecessary columns.

Preprocessor Class

Implemented in data_preprocessing/preprocessing.py:
from sklearn_pandas import CategoricalImputer
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler

class Preprocessor:
    def __init__(self, file_object, logger_object):
        self.file_object = file_object
        self.logger_object = logger_object

Preprocessing Pipeline

1

Remove Unwanted Columns

Drop columns that don’t contribute to prediction
2

Check for Missing Values

Identify columns with missing data
3

Impute Missing Values

Fill missing values using CategoricalImputer
4

Encode Categorical Features

Convert categorical variables to numerical format
5

Separate Features and Labels

Split data into X (features) and Y (target)
6

Scale Numerical Columns

Standardize numerical features

Column Removal

Columns that don’t contribute to fraud detection are removed:
def remove_columns(self, data, columns):
    self.logger_object.log(self.file_object, 
                          'Entered the remove_columns method of the Preprocessor class')
    self.data = data
    self.columns = columns
    
    try:
        # Drop the specified columns
        self.useful_data = self.data.drop(labels=self.columns, axis=1)
        
        self.logger_object.log(self.file_object,
                              'Column removal Successful.Exited the remove_columns method of the Preprocessor class')
        return self.useful_data
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in remove_columns method of the Preprocessor class. Exception message: ' + str(e))
        raise Exception()

Columns Removed During Training

From trainingModel.py:40, these columns are dropped:
data = preprocessor.remove_columns(data, [
    'policy_number',        # Unique identifier, no predictive value
    'policy_bind_date',     # Date field, redundant
    'policy_state',         # High cardinality
    'insured_zip',          # High cardinality
    'incident_location',    # High cardinality
    'incident_date',        # Date field, redundant
    'incident_state',       # High cardinality
    'incident_city',        # High cardinality
    'insured_hobbies',      # High cardinality
    'auto_make',            # High cardinality
    'auto_model',           # High cardinality
    'auto_year',            # Redundant with age
    'age',                  # Redundant with months_as_customer
    'total_claim_amount'    # Target leakage
])
High cardinality columns are removed to prevent overfitting and reduce model complexity.

Missing Value Detection

Identify which columns contain missing values:
def is_null_present(self, data):
    self.logger_object.log(self.file_object, 
                          'Entered the is_null_present method of the Preprocessor class')
    self.null_present = False
    self.cols_with_missing_values = []
    self.cols = data.columns
    
    try:
        # Check for the count of null values per column
        self.null_counts = data.isna().sum()
        
        for i in range(len(self.null_counts)):
            if self.null_counts[i] > 0:
                self.null_present = True
                self.cols_with_missing_values.append(self.cols[i])
        
        # Write the logs to see which columns have null values
        if(self.null_present):
            self.dataframe_with_null = pd.DataFrame()
            self.dataframe_with_null['columns'] = data.columns
            self.dataframe_with_null['missing values count'] = np.asarray(data.isna().sum())
            # Store the null column information to file
            self.dataframe_with_null.to_csv('preprocessing_data/null_values.csv')
        
        self.logger_object.log(self.file_object,
                              'Finding missing values is a success.Data written to the null values file. Exited the is_null_present method of the Preprocessor class')
        return self.null_present, self.cols_with_missing_values
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in is_null_present method of the Preprocessor class. Exception message: ' + str(e))
        raise Exception()
Output: Creates preprocessing_data/null_values.csv with missing value counts per column.

Missing Value Imputation

Uses CategoricalImputer to fill missing values:
def impute_missing_values(self, data, cols_with_missing_values):
    self.logger_object.log(self.file_object, 
                          'Entered the impute_missing_values method of the Preprocessor class')
    self.data = data
    self.cols_with_missing_values = cols_with_missing_values
    
    try:
        self.imputer = CategoricalImputer()
        
        # Impute each column with missing values
        for col in self.cols_with_missing_values:
            self.data[col] = self.imputer.fit_transform(self.data[col])
        
        self.logger_object.log(self.file_object, 
                              'Imputing missing values Successful. Exited the impute_missing_values method of the Preprocessor class')
        return self.data
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in impute_missing_values method of the Preprocessor class. Exception message: ' + str(e))
        raise Exception()
CategoricalImputer fills missing values with the most frequent value (mode) in each column. This works for both categorical and numerical features.

Categorical Encoding

Converts categorical variables to numerical format using two approaches:

1. Label Encoding (Ordinal Features)

For features with inherent order:
def encode_categorical_columns(self, data):
    self.logger_object.log(self.file_object, 
                          'Entered the encode_categorical_columns method of the Preprocessor class')
    self.data = data
    
    try:
        # Select categorical columns
        self.cat_df = self.data.select_dtypes(include=['object']).copy()
        
        # Label encoding for ordinal features
        self.cat_df['policy_csl'] = self.cat_df['policy_csl'].map({
            '100/300': 1, 
            '250/500': 2.5, 
            '500/1000': 5
        })
        
        self.cat_df['insured_education_level'] = self.cat_df['insured_education_level'].map({
            'JD': 1, 
            'High School': 2, 
            'College': 3, 
            'Masters': 4, 
            'Associate': 5, 
            'MD': 6, 
            'PhD': 7
        })
        
        self.cat_df['incident_severity'] = self.cat_df['incident_severity'].map({
            'Trivial Damage': 1, 
            'Minor Damage': 2, 
            'Major Damage': 3, 
            'Total Loss': 4
        })
        
        # Binary encoding
        self.cat_df['insured_sex'] = self.cat_df['insured_sex'].map({
            'FEMALE': 0, 
            'MALE': 1
        })
        
        self.cat_df['property_damage'] = self.cat_df['property_damage'].map({
            'NO': 0, 
            'YES': 1
        })
        
        self.cat_df['police_report_available'] = self.cat_df['police_report_available'].map({
            'NO': 0, 
            'YES': 1
        })
        
        try:
            # Code block for training (includes target variable)
            self.cat_df['fraud_reported'] = self.cat_df['fraud_reported'].map({
                'N': 0, 
                'Y': 1
            })
            self.cols_to_drop = ['policy_csl', 'insured_education_level', 'incident_severity', 
                                'insured_sex', 'property_damage', 'police_report_available', 
                                'fraud_reported']
        except:
            # Code block for Prediction (no target variable)
            self.cols_to_drop = ['policy_csl', 'insured_education_level', 'incident_severity',
                                'insured_sex', 'property_damage', 'police_report_available']

2. One-Hot Encoding (Nominal Features)

For features without inherent order:
        # Using dummy encoding for remaining categorical columns
        for col in self.cat_df.drop(columns=self.cols_to_drop).columns:
            self.cat_df = pd.get_dummies(self.cat_df, columns=[col], prefix=[col], drop_first=True)
        
        # Replace original categorical columns with encoded versions
        self.data.drop(columns=self.data.select_dtypes(include=['object']).columns, inplace=True)
        self.data = pd.concat([self.cat_df, self.data], axis=1)
        
        self.logger_object.log(self.file_object, 
                              'encoding for categorical values successful. Exited the encode_categorical_columns method of the Preprocessor class')
        return self.data
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in encode_categorical_columns method of the Preprocessor class. Exception message: ' + str(e))
        raise Exception()
drop_first=True in one-hot encoding prevents multicollinearity by dropping one category per feature.

Feature-Label Separation

Split data into features (X) and target (Y):
def separate_label_feature(self, data, label_column_name):
    self.logger_object.log(self.file_object, 
                          'Entered the separate_label_feature method of the Preprocessor class')
    try:
        # Drop the target column to get features
        self.X = data.drop(labels=label_column_name, axis=1)
        # Filter the Label column
        self.Y = data[label_column_name]
        
        self.logger_object.log(self.file_object,
                              'Label Separation Successful. Exited the separate_label_feature method of the Preprocessor class')
        return self.X, self.Y
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in separate_label_feature method of the Preprocessor class. Exception message: ' + str(e))
        raise Exception()

Numerical Feature Scaling

Standardize numerical features using StandardScaler:
def scale_numerical_columns(self, data):
    self.logger_object.log(self.file_object,
                          'Entered the scale_numerical_columns method of the Preprocessor class')
    self.data = data
    
    # Select numerical columns for scaling
    self.num_df = self.data[[
        'months_as_customer', 
        'policy_deductable', 
        'umbrella_limit',
        'capital-gains', 
        'capital-loss', 
        'incident_hour_of_the_day',
        'number_of_vehicles_involved', 
        'bodily_injuries', 
        'witnesses', 
        'injury_claim',
        'property_claim',
        'vehicle_claim'
    ]]
    
    try:
        # Apply StandardScaler
        self.scaler = StandardScaler()
        self.scaled_data = self.scaler.fit_transform(self.num_df)
        self.scaled_num_df = pd.DataFrame(data=self.scaled_data, 
                                         columns=self.num_df.columns, 
                                         index=self.data.index)
        
        # Replace original numerical columns with scaled versions
        self.data.drop(columns=self.scaled_num_df.columns, inplace=True)
        self.data = pd.concat([self.scaled_num_df, self.data], axis=1)
        
        self.logger_object.log(self.file_object, 
                              'scaling for numerical values successful. Exited the scale_numerical_columns method of the Preprocessor class')
        return self.data
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in scale_numerical_columns method of the Preprocessor class. Exception message: ' + str(e))
        raise Exception()
StandardScaler transforms features to have mean=0 and standard deviation=1, which improves model convergence and performance.

Handling Imbalanced Data

Use Random Over Sampling to balance the target classes:
def handle_imbalanced_dataset(self, x, y):
    self.logger_object.log(self.file_object,
                          'Entered the handle_imbalanced_dataset method of the Preprocessor class')
    
    try:
        self.rdsmple = RandomOverSampler()
        self.x_sampled, self.y_sampled = self.rdsmple.fit_sample(x, y)
        
        self.logger_object.log(self.file_object,
                              'dataset balancing successful. Exited the handle_imbalanced_dataset method of the Preprocessor class')
        return self.x_sampled, self.y_sampled
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in handle_imbalanced_dataset method of the Preprocessor class. Exception message: ' + str(e))
        raise Exception()
RandomOverSampler duplicates minority class samples to balance the dataset. This helps prevent models from being biased toward the majority class.

Complete Preprocessing Example

From trainingModel.py, here’s how preprocessing is applied:
preprocessor = preprocessing.Preprocessor(self.file_object, self.log_writer)

# 1. Remove unwanted columns
data = preprocessor.remove_columns(data, ['policy_number', 'policy_bind_date', ...])

# 2. Replace '?' with NaN
data.replace('?', np.NaN, inplace=True)

# 3. Check for missing values
is_null_present, cols_with_missing_values = preprocessor.is_null_present(data)

# 4. Impute missing values if present
if (is_null_present):
    data = preprocessor.impute_missing_values(data, cols_with_missing_values)

# 5. Encode categorical data
data = preprocessor.encode_categorical_columns(data)

# 6. Separate features and labels
X, Y = preprocessor.separate_label_feature(data, label_column_name='fraud_reported')

# 7. Scale numerical columns (done per cluster after split)
x_train = preprocessor.scale_numerical_columns(x_train)
x_test = preprocessor.scale_numerical_columns(x_test)

Next Steps

After preprocessing:
  1. Data is ready for clustering
  2. Features are in numerical format
  3. Missing values are handled
  4. Numerical features are scaled
Proceed to clustering to create cluster-specific models.

Build docs developers (and LLMs) love