Overview
The Model_Finder class compares two machine learning algorithms (XGBoost and SVM) for each cluster and selects the best performer based on AUC score. Both models undergo hyperparameter tuning using GridSearchCV.
Model_Finder Class
Implemented in best_model_finder/tuner.py:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
class Model_Finder:
def __init__(self, file_object, logger_object):
self.file_object = file_object
self.logger_object = logger_object
self.sv_classifier = SVC()
self.xgb = XGBClassifier(objective='binary:logistic', n_jobs=-1)
Model Selection Pipeline
Train XGBoost
Tune hyperparameters and train XGBoost model
Evaluate XGBoost
Calculate AUC score on test set
Train SVM
Tune hyperparameters and train SVM model
Evaluate SVM
Calculate AUC score on test set
Compare Models
Select model with higher AUC score
XGBoost Hyperparameter Tuning
Parameter Grid
def get_best_params_for_xgboost(self, train_x, train_y):
self.logger_object.log(self.file_object,
'Entered the get_best_params_for_xgboost method of the Model_Finder class')
try:
# Initializing with different combination of parameters
self.param_grid_xgboost = {
"n_estimators": [100, 130],
"criterion": ['gini', 'entropy'],
"max_depth": range(8, 10, 1) # [8, 9]
}
Number of boosting rounds
- 100: Faster training, may underfit
- 130: More iterations, better fit
Split quality measure
- gini: Gini impurity (faster)
- entropy: Information gain (more computationally expensive)
Maximum tree depth
- 8: Shallower trees, less overfitting
- 9: Deeper trees, more complex patterns
GridSearchCV for XGBoost
# Creating an object of the Grid Search class
self.grid = GridSearchCV(
XGBClassifier(objective='binary:logistic'),
self.param_grid_xgboost,
verbose=3,
cv=5
)
# Finding the best parameters
self.grid.fit(train_x, train_y)
# Extracting the best parameters
self.criterion = self.grid.best_params_['criterion']
self.max_depth = self.grid.best_params_['max_depth']
self.n_estimators = self.grid.best_params_['n_estimators']
# Creating a new model with the best parameters
self.xgb = XGBClassifier(
criterion=self.criterion,
max_depth=self.max_depth,
n_estimators=self.n_estimators,
n_jobs=-1
)
# Training the new model
self.xgb.fit(train_x, train_y)
self.logger_object.log(self.file_object,
'XGBoost best params: ' + str(self.grid.best_params_) +
'. Exited the get_best_params_for_xgboost method of the Model_Finder class')
return self.xgb
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in get_best_params_for_xgboost method of the Model_Finder class. Exception message: ' + str(e))
raise Exception()
GridSearchCV Configuration:
- cv=5: 5-fold cross-validation for robust parameter selection
- verbose=3: Detailed logging of grid search progress
- objective=‘binary:logistic’: Binary classification task
- n_jobs=-1: Use all CPU cores for parallel training
GridSearchCV tests all combinations: 2 × 2 × 2 = 8 parameter combinations × 5 folds = 40 model fits
SVM Hyperparameter Tuning
Parameter Grid
def get_best_params_for_svm(self, train_x, train_y):
self.logger_object.log(self.file_object,
'Entered the get_best_params_for_svm method of the Model_Finder class')
try:
# Initializing with different combination of parameters
self.param_grid = {
"kernel": ['rbf', 'sigmoid'],
"C": [0.1, 0.5, 1.0],
"random_state": [0, 100, 200, 300]
}
Kernel function for transforming input space
- rbf: Radial Basis Function (good for non-linear patterns)
- sigmoid: Sigmoid kernel (neural network-like)
Regularization parameter (controls overfitting)
- 0.1: Strong regularization (simpler model)
- 0.5: Moderate regularization
- 1.0: Weak regularization (more complex model)
Random seed for reproducibility
- Tests multiple random states for robustness
GridSearchCV for SVM
# Creating an object of the Grid Search class
self.grid = GridSearchCV(
estimator=self.sv_classifier,
param_grid=self.param_grid,
cv=5,
verbose=3
)
# Finding the best parameters
self.grid.fit(train_x, train_y)
# Extracting the best parameters
self.kernel = self.grid.best_params_['kernel']
self.C = self.grid.best_params_['C']
self.random_state = self.grid.best_params_['random_state']
# Creating a new model with the best parameters
self.sv_classifier = SVC(
kernel=self.kernel,
C=self.C,
random_state=self.random_state
)
# Training the new model
self.sv_classifier.fit(train_x, train_y)
self.logger_object.log(self.file_object,
'SVM best params: ' + str(self.grid.best_params_) +
'. Exited the get_best_params_for_svm method of the Model_Finder class')
return self.sv_classifier
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in get_best_params_for_svm method of the Model_Finder class. Exception message: ' + str(e))
raise Exception()
GridSearchCV Configuration:
- cv=5: 5-fold cross-validation
- verbose=3: Detailed progress logging
GridSearchCV tests all combinations: 2 × 3 × 4 = 24 parameter combinations × 5 folds = 120 model fits
Model Comparison
After training both models, compare their AUC scores:
def get_best_model(self, train_x, train_y, test_x, test_y):
self.logger_object.log(self.file_object,
'Entered the get_best_model method of the Model_Finder class')
try:
# Create best model for XGBoost
self.xgboost = self.get_best_params_for_xgboost(train_x, train_y)
self.prediction_xgboost = self.xgboost.predict(test_x)
# Calculate XGBoost score
if len(test_y.unique()) == 1:
# If there is only one label in y, use accuracy instead of AUC
self.xgboost_score = accuracy_score(test_y, self.prediction_xgboost)
self.logger_object.log(self.file_object,
'Accuracy for XGBoost:' + str(self.xgboost_score))
else:
# Use AUC score for normal cases
self.xgboost_score = roc_auc_score(test_y, self.prediction_xgboost)
self.logger_object.log(self.file_object,
'AUC for XGBoost:' + str(self.xgboost_score))
# Create best model for SVM
self.svm = self.get_best_params_for_svm(train_x, train_y)
self.prediction_svm = self.svm.predict(test_x)
# Calculate SVM score
if len(test_y.unique()) == 1:
self.svm_score = accuracy_score(test_y, self.prediction_svm)
self.logger_object.log(self.file_object,
'Accuracy for SVM:' + str(self.svm_score))
else:
self.svm_score = roc_auc_score(test_y, self.prediction_svm)
self.logger_object.log(self.file_object,
'AUC for SVM:' + str(self.svm_score))
# Comparing the two models
if(self.svm_score < self.xgboost_score):
return 'XGBoost', self.xgboost
else:
return 'SVM', self.sv_classifier
except Exception as e:
self.logger_object.log(self.file_object,
'Exception occured in get_best_model method of the Model_Finder class. Exception message: ' + str(e))
raise Exception()
AUC Score vs Accuracy
Special Case Handling: If the test set contains only one class (all fraud or all legitimate), the system falls back to accuracy score because AUC cannot be calculated.
if len(test_y.unique()) == 1:
# Use accuracy_score
else:
# Use roc_auc_score
Evaluation Metrics
AUC (Area Under ROC Curve)
Primary Metric: Used for model comparison
- Range: 0.0 to 1.0
- Interpretation:
- 0.5 = Random guessing
- 0.7-0.8 = Acceptable
- 0.8-0.9 = Excellent
- 0.9+ = Outstanding
- Advantage: Threshold-independent, evaluates across all classification thresholds
Why AUC for Fraud Detection?
- Class Imbalance: Fraud cases are rare; AUC handles imbalanced datasets well
- Threshold Flexibility: Allows adjusting sensitivity vs. specificity based on business needs
- Comprehensive: Evaluates model performance across all possible thresholds
Model Selection Example
Cluster 0 Results
XGBoost Training:
├─ Best Params: {n_estimators: 130, criterion: 'gini', max_depth: 9}
├─ Training Time: 45s
└─ AUC Score: 0.94
SVM Training:
├─ Best Params: {kernel: 'rbf', C: 1.0, random_state: 100}
├─ Training Time: 78s
└─ AUC Score: 0.89
🏆 Winner: XGBoost (saved as XGBoost0)
Cluster 1 Results
XGBoost Training:
├─ Best Params: {n_estimators: 100, criterion: 'entropy', max_depth: 8}
├─ Training Time: 38s
└─ AUC Score: 0.88
SVM Training:
├─ Best Params: {kernel: 'rbf', C: 0.5, random_state: 200}
├─ Training Time: 65s
└─ AUC Score: 0.91
🏆 Winner: SVM (saved as SVM1)
Hyperparameter Tuning Comparison
| Model | Parameters Tested | Combinations | CV Folds | Total Fits |
|---|
| XGBoost | 3 (n_estimators, criterion, max_depth) | 8 | 5 | 40 |
| SVM | 3 (kernel, C, random_state) | 24 | 5 | 120 |
Grid Search Progress
With verbose=3, you’ll see detailed output:
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END ...C=0.1, kernel=rbf, random_state=0;, score=0.847
[CV 2/5] END ...C=0.1, kernel=rbf, random_state=0;, score=0.852
[CV 3/5] END ...C=0.1, kernel=rbf, random_state=0;, score=0.839
...
Best Practices
Cross-Validation
Use 5-fold CV to ensure parameter selection is robust
Multiple Algorithms
Compare different algorithms (XGBoost vs SVM) rather than assuming one is best
Comprehensive Grid
Test multiple parameter values to explore the parameter space
Log Everything
Record best parameters and scores for reproducibility and debugging
Model Selection Output
The get_best_model() method returns:
best_model_name, best_model = model_finder.get_best_model(
x_train, y_train, x_test, y_test
)
# Returns:
# best_model_name: 'XGBoost' or 'SVM'
# best_model: Trained model object ready for prediction
This is then saved with cluster ID:
file_op.save_model(best_model, best_model_name + str(cluster_id))
# Examples: 'XGBoost0', 'SVM1', 'XGBoost2'
Training Time Factors
- GridSearchCV: Tests many parameter combinations (8-24 combinations)
- Cross-Validation: Each combination tested 5 times
- Dataset Size: Larger clusters take longer to train
- Model Complexity: SVM typically slower than XGBoost
Optimization Tips
# XGBoost uses all cores
XGBClassifier(objective='binary:logistic', n_jobs=-1)
# GridSearchCV can parallelize
GridSearchCV(..., n_jobs=-1) # Add this parameter
Next Steps
After model selection:
- Best model for each cluster is saved to disk
- Models are ready for deployment and prediction
- Review AUC scores to assess model quality
- Consider retraining if AUC scores are below acceptable thresholds
Summary
The model selection process:
- Trains two algorithms (XGBoost and SVM) per cluster
- Uses GridSearchCV with 5-fold cross-validation for hyperparameter tuning
- Compares models using AUC score
- Selects and saves the best performing model for each cluster
- Logs all parameters and scores for transparency and reproducibility