Overview
Data preprocessing is a critical step to ensure the model receives clean, consistent, and properly formatted data. This page documents all preprocessing steps applied to the fraud detection dataset.Feature Selection
The first step is to identify which features to use for modeling:- Only use features present in both train and test sets
- Exclude
idas it’s just an identifier with no predictive value - Exclude
FRAUDEas it’s the target variable
Handling Missing Values
Missing data can negatively impact model performance. Different strategies are applied based on data type:Numeric Features
For numeric columns, missing values are replaced with the mean of that column:- Simple and effective for numeric data
- Preserves the distribution’s central tendency
- Uses train mean for test data to prevent data leakage
Categorical Features
For categorical columns, missing values are replaced with the label “Desconocido” (Unknown):- Creates an explicit category for missing values
- Allows the model to learn patterns associated with missing data
- More informative than dropping rows with missing values
One-Hot Encoding
Machine learning models require numeric inputs. Categorical variables are converted to binary columns:| Original | Category_A | Category_B | Category_C |
|---|---|---|---|
| A | 1 | 0 | 0 |
| B | 0 | 1 | 0 |
| C | 0 | 0 | 1 |
drop_first=True, we drop the first category to avoid multicollinearity:
| Original | Category_B | Category_C |
|---|---|---|
| A | 0 | 0 |
| B | 1 | 0 |
| C | 0 | 1 |
Train-Test Alignment
A critical step to ensure the test set has the same columns as training:- One-hot encoding may create different columns if categories differ between train/test
- Missing columns in test are filled with 0 (indicating absence of that category)
- Extra columns in test are removed
- Ensures model receives expected input shape
Complete Preprocessing Pipeline
Here’s the full preprocessing code with comments:Data Quality Checks
After preprocessing:- No missing values remain in the dataset
- All features are numeric
- Train and test have identical column structures
- Data is ready for model training