Dataset Overview
The fraud detection dataset contains transaction data with various features used to predict fraudulent activity.Files
- train.csv - Training dataset with labels
- test.csv - Test dataset without labels
- test_evaluado.csv - Model predictions on test set
Input Features
The dataset includes various features describing transactions. Features are a mix of numeric and categorical variables.Feature Selection
Feature Types
Numeric Features
- Missing values filled with column mean
- No scaling required for Random Forest model
Categorical Features
- Missing values filled with “Desconocido” (Unknown)
- One-hot encoded using
pd.get_dummies() - Test set aligned with training columns
Target Variable
FRAUDE Column
| Value | Description |
|---|---|
0 | Legitimate transaction |
1 | Fraudulent transaction |
Class Distribution
The dataset exhibits class imbalance, addressed using:Output Format
test_evaluado.csv
Prediction file format with model results.Schema
| Column | Type | Description |
|---|---|---|
id | bigint | Transaction identifier |
FRAUDE | integer | Predicted fraud label (0 or 1) |
Example Data
- Total records: 100
- Predicted fraud (1): Varies by model performance
- Predicted legitimate (0): Varies by model performance
Data Processing Pipeline
1. Missing Value Treatment
2. One-Hot Encoding
3. Train-Test Split
Model Performance Metrics
Confusion Matrix
- True Negatives: 436
- False Positives: 11
- False Negatives: 17
- True Positives: 129
Classification Report
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| 0 (Legitimate) | 0.96 | 0.98 | 0.97 | 447 |
| 1 (Fraud) | 0.92 | 0.88 | 0.90 | 146 |
| Accuracy | 0.95 | 593 | ||
| Macro Avg | 0.94 | 0.93 | 0.94 | 593 |
| Weighted Avg | 0.95 | 0.95 | 0.95 | 593 |
AUC-ROC Score
0.9881 - Excellent discrimination between fraud and legitimate transactionsUsage Example
Generate Predictions
Data Quality Notes
- Missing Values: Present in both numeric and categorical features
- Class Imbalance: Fraud cases are minority class (addressed via
class_weight='balanced') - Feature Consistency: Test set features aligned with training set using
reindex() - Reproducibility: Random state set to 42 for consistent results