Skip to main content

Overview

The Preprocessor class provides comprehensive data cleaning and transformation capabilities for preparing raw fraud detection data for machine learning. It handles missing values, categorical encoding, feature scaling, and dataset balancing.

Class: Preprocessor

Location: source/data_preprocessing/preprocessing.py Version: 1.0

Constructor

Preprocessor(file_object, logger_object)
file_object
File
required
File object for logging operations
logger_object
Logger
required
Logger instance for tracking preprocessing steps

Methods

remove_columns()

Removes specified columns from a pandas DataFrame.
remove_columns(data, columns)
data
pandas.DataFrame
required
Input DataFrame to process
columns
list
required
List of column names to remove
return
pandas.DataFrame
DataFrame with specified columns removed
Example Usage:
preprocessor = Preprocessor(file_object, logger_object)
df_cleaned = preprocessor.remove_columns(data, ['unwanted_col1', 'unwanted_col2'])
Implementation:
self.useful_data = self.data.drop(labels=self.columns, axis=1)
return self.useful_data

remove_unwanted_spaces()

Removes leading and trailing whitespace from all string columns in the DataFrame.
remove_unwanted_spaces(data)
data
pandas.DataFrame
required
Input DataFrame with potential whitespace issues
return
pandas.DataFrame
DataFrame with stripped string values
Example Usage:
df_trimmed = preprocessor.remove_unwanted_spaces(data)
Implementation:
self.df_without_spaces = self.data.apply(
    lambda x: x.str.strip() if x.dtype == "object" else x
)
return self.df_without_spaces

separate_label_feature()

Separates feature columns from the target label column.
separate_label_feature(data, label_column_name)
data
pandas.DataFrame
required
Complete dataset with features and labels
label_column_name
str
required
Name of the target column to separate
return
tuple
Tuple containing (X, Y) where X is features DataFrame and Y is target Series
Example Usage:
X, Y = preprocessor.separate_label_feature(data, 'fraud_reported')
Implementation:
self.X = data.drop(labels=label_column_name, axis=1)
self.Y = data[label_column_name]
return self.X, self.Y

is_null_present()

Checks for null values in the DataFrame and saves a report to CSV.
is_null_present(data)
data
pandas.DataFrame
required
DataFrame to check for missing values
return
tuple
Tuple containing (null_present: bool, cols_with_missing_values: list)
Example Usage:
has_nulls, null_columns = preprocessor.is_null_present(data)
if has_nulls:
    print(f"Columns with nulls: {null_columns}")
Implementation:
self.null_counts = data.isna().sum()
for i in range(len(self.null_counts)):
    if self.null_counts[i] > 0:
        self.null_present = True
        self.cols_with_missing_values.append(self.cols[i])

if self.null_present:
    self.dataframe_with_null = pd.DataFrame()
    self.dataframe_with_null['columns'] = data.columns
    self.dataframe_with_null['missing values count'] = np.asarray(data.isna().sum())
    self.dataframe_with_null.to_csv('preprocessing_data/null_values.csv')

return self.null_present, self.cols_with_missing_values
Output File: preprocessing_data/null_values.csv - CSV report of null value counts per column

impute_missing_values()

Imputes missing values in specified columns using Categorical Imputer.
impute_missing_values(data, cols_with_missing_values)
data
pandas.DataFrame
required
DataFrame with missing values
cols_with_missing_values
list
required
List of column names containing null values
return
pandas.DataFrame
DataFrame with imputed values
Example Usage:
has_nulls, null_cols = preprocessor.is_null_present(data)
if has_nulls:
    data = preprocessor.impute_missing_values(data, null_cols)
Implementation:
self.imputer = CategoricalImputer()
for col in self.cols_with_missing_values:
    self.data[col] = self.imputer.fit_transform(self.data[col])
return self.data
Uses sklearn_pandas.CategoricalImputer for imputation strategy

encode_categorical_columns()

Encodes categorical variables to numerical values using ordinal encoding and one-hot encoding.
encode_categorical_columns(data)
data
pandas.DataFrame
required
DataFrame with categorical columns
return
pandas.DataFrame
DataFrame with all categorical values converted to numerical
Encoding Mappings:
  • policy_csl: {'100/300': 1, '250/500': 2.5, '500/1000': 5}
  • insured_education_level: {'JD': 1, 'High School': 2, 'College': 3, 'Masters': 4, 'Associate': 5, 'MD': 6, 'PhD': 7}
  • incident_severity: {'Trivial Damage': 1, 'Minor Damage': 2, 'Major Damage': 3, 'Total Loss': 4}
  • insured_sex: {'FEMALE': 0, 'MALE': 1}
  • property_damage: {'NO': 0, 'YES': 1}
  • police_report_available: {'NO': 0, 'YES': 1}
  • fraud_reported: {'N': 0, 'Y': 1} (training only)
Example Usage:
df_encoded = preprocessor.encode_categorical_columns(data)
Implementation:
self.cat_df = self.data.select_dtypes(include=['object']).copy()
self.cat_df['policy_csl'] = self.cat_df['policy_csl'].map(
    {'100/300': 1, '250/500': 2.5, '500/1000': 5}
)
# ... additional ordinal mappings ...

# One-hot encoding for remaining categorical columns
for col in self.cat_df.drop(columns=self.cols_to_drop).columns:
    self.cat_df = pd.get_dummies(
        self.cat_df, 
        columns=[col], 
        prefix=[col], 
        drop_first=True
    )

self.data.drop(columns=self.data.select_dtypes(include=['object']).columns, inplace=True)
self.data = pd.concat([self.cat_df, self.data], axis=1)
return self.data
The method automatically handles both training (with fraud_reported) and prediction modes

scale_numerical_columns()

Scales numerical features using StandardScaler for normalization.
scale_numerical_columns(data)
data
pandas.DataFrame
required
DataFrame with numerical columns to scale
return
pandas.DataFrame
DataFrame with scaled numerical features
Scaled Features:
  • months_as_customer
  • policy_deductable
  • umbrella_limit
  • capital-gains
  • capital-loss
  • incident_hour_of_the_day
  • number_of_vehicles_involved
  • bodily_injuries
  • witnesses
  • injury_claim
  • property_claim
  • vehicle_claim
Example Usage:
df_scaled = preprocessor.scale_numerical_columns(data)
Implementation:
self.num_df = self.data[[
    'months_as_customer', 'policy_deductable', 'umbrella_limit',
    'capital-gains', 'capital-loss', 'incident_hour_of_the_day',
    'number_of_vehicles_involved', 'bodily_injuries', 'witnesses', 
    'injury_claim', 'property_claim', 'vehicle_claim'
]]

self.scaler = StandardScaler()
self.scaled_data = self.scaler.fit_transform(self.num_df)
self.scaled_num_df = pd.DataFrame(
    data=self.scaled_data, 
    columns=self.num_df.columns, 
    index=self.data.index
)
self.data.drop(columns=self.scaled_num_df.columns, inplace=True)
self.data = pd.concat([self.scaled_num_df, self.data], axis=1)
return self.data
Uses sklearn.preprocessing.StandardScaler which standardizes features by removing the mean and scaling to unit variance

handle_imbalanced_dataset()

Balances the dataset using Random Over Sampling to address class imbalance.
handle_imbalanced_dataset(x, y)
x
pandas.DataFrame
required
Feature matrix
y
pandas.Series
required
Target labels
return
tuple
Tuple containing (x_sampled, y_sampled) with balanced class distribution
Example Usage:
X_balanced, Y_balanced = preprocessor.handle_imbalanced_dataset(X, Y)
print(f"Original shape: {X.shape}, Balanced shape: {X_balanced.shape}")
Implementation:
self.rdsmple = RandomOverSampler()
self.x_sampled, self.y_sampled = self.rdsmple.fit_sample(x, y)
return self.x_sampled, self.y_sampled
Over-sampling increases the dataset size. Ensure sufficient memory is available.

Complete Preprocessing Pipeline

Here’s a typical preprocessing workflow:
from data_preprocessing.preprocessing import Preprocessor

# Initialize
preprocessor = Preprocessor(file_object, logger_object)

# 1. Remove unwanted spaces
data = preprocessor.remove_unwanted_spaces(data)

# 2. Remove unnecessary columns
data = preprocessor.remove_columns(data, ['policy_number'])

# 3. Check and handle missing values
has_nulls, null_cols = preprocessor.is_null_present(data)
if has_nulls:
    data = preprocessor.impute_missing_values(data, null_cols)

# 4. Encode categorical variables
data = preprocessor.encode_categorical_columns(data)

# 5. Scale numerical features
data = preprocessor.scale_numerical_columns(data)

# 6. Separate features and labels
X, Y = preprocessor.separate_label_feature(data, 'fraud_reported')

# 7. Handle imbalanced dataset
X_balanced, Y_balanced = preprocessor.handle_imbalanced_dataset(X, Y)

Dependencies

import pandas as pd
import numpy as np
from sklearn_pandas import CategoricalImputer
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler

Build docs developers (and LLMs) love