Model Selection

Overview

The Model_Finder class compares two machine learning algorithms (XGBoost and SVM) for each cluster and selects the best performer based on AUC score. Both models undergo hyperparameter tuning using GridSearchCV.

Model_Finder Class

Implemented in best_model_finder/tuner.py:

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

class Model_Finder:
    def __init__(self, file_object, logger_object):
        self.file_object = file_object
        self.logger_object = logger_object
        self.sv_classifier = SVC()
        self.xgb = XGBClassifier(objective='binary:logistic', n_jobs=-1)

Model Selection Pipeline

Train XGBoost

Tune hyperparameters and train XGBoost model

Evaluate XGBoost

Calculate AUC score on test set

Train SVM

Tune hyperparameters and train SVM model

Evaluate SVM

Calculate AUC score on test set

Compare Models

Select model with higher AUC score

XGBoost Hyperparameter Tuning

Parameter Grid

def get_best_params_for_xgboost(self, train_x, train_y):
    self.logger_object.log(self.file_object,
                          'Entered the get_best_params_for_xgboost method of the Model_Finder class')
    try:
        # Initializing with different combination of parameters
        self.param_grid_xgboost = {
            "n_estimators": [100, 130],
            "criterion": ['gini', 'entropy'],
            "max_depth": range(8, 10, 1)  # [8, 9]
        }

n_estimators

list

Number of boosting rounds

100: Faster training, may underfit
130: More iterations, better fit

criterion

list

Split quality measure

gini: Gini impurity (faster)
entropy: Information gain (more computationally expensive)

max_depth

range

Maximum tree depth

8: Shallower trees, less overfitting
9: Deeper trees, more complex patterns

GridSearchCV for XGBoost

        # Creating an object of the Grid Search class
        self.grid = GridSearchCV(
            XGBClassifier(objective='binary:logistic'),
            self.param_grid_xgboost,
            verbose=3,
            cv=5
        )
        
        # Finding the best parameters
        self.grid.fit(train_x, train_y)
        
        # Extracting the best parameters
        self.criterion = self.grid.best_params_['criterion']
        self.max_depth = self.grid.best_params_['max_depth']
        self.n_estimators = self.grid.best_params_['n_estimators']
        
        # Creating a new model with the best parameters
        self.xgb = XGBClassifier(
            criterion=self.criterion,
            max_depth=self.max_depth,
            n_estimators=self.n_estimators,
            n_jobs=-1
        )
        
        # Training the new model
        self.xgb.fit(train_x, train_y)
        
        self.logger_object.log(self.file_object,
                              'XGBoost best params: ' + str(self.grid.best_params_) + 
                              '. Exited the get_best_params_for_xgboost method of the Model_Finder class')
        return self.xgb
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in get_best_params_for_xgboost method of the Model_Finder class. Exception message: ' + str(e))
        raise Exception()

GridSearchCV Configuration:

cv=5: 5-fold cross-validation for robust parameter selection
verbose=3: Detailed logging of grid search progress
objective=‘binary:logistic’: Binary classification task
n_jobs=-1: Use all CPU cores for parallel training

GridSearchCV tests all combinations: 2 × 2 × 2 = 8 parameter combinations × 5 folds = 40 model fits

SVM Hyperparameter Tuning

Parameter Grid

def get_best_params_for_svm(self, train_x, train_y):
    self.logger_object.log(self.file_object, 
                          'Entered the get_best_params_for_svm method of the Model_Finder class')
    try:
        # Initializing with different combination of parameters
        self.param_grid = {
            "kernel": ['rbf', 'sigmoid'],
            "C": [0.1, 0.5, 1.0],
            "random_state": [0, 100, 200, 300]
        }

kernel

list

Kernel function for transforming input space

rbf: Radial Basis Function (good for non-linear patterns)
sigmoid: Sigmoid kernel (neural network-like)

list

Regularization parameter (controls overfitting)

0.1: Strong regularization (simpler model)
0.5: Moderate regularization
1.0: Weak regularization (more complex model)

random_state

list

Random seed for reproducibility

Tests multiple random states for robustness

GridSearchCV for SVM

        # Creating an object of the Grid Search class
        self.grid = GridSearchCV(
            estimator=self.sv_classifier,
            param_grid=self.param_grid,
            cv=5,
            verbose=3
        )
        
        # Finding the best parameters
        self.grid.fit(train_x, train_y)
        
        # Extracting the best parameters
        self.kernel = self.grid.best_params_['kernel']
        self.C = self.grid.best_params_['C']
        self.random_state = self.grid.best_params_['random_state']
        
        # Creating a new model with the best parameters
        self.sv_classifier = SVC(
            kernel=self.kernel,
            C=self.C,
            random_state=self.random_state
        )
        
        # Training the new model
        self.sv_classifier.fit(train_x, train_y)
        
        self.logger_object.log(self.file_object,
                              'SVM best params: ' + str(self.grid.best_params_) + 
                              '. Exited the get_best_params_for_svm method of the Model_Finder class')
        return self.sv_classifier
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in get_best_params_for_svm method of the Model_Finder class. Exception message: ' + str(e))
        raise Exception()

GridSearchCV Configuration:

cv=5: 5-fold cross-validation
verbose=3: Detailed progress logging

GridSearchCV tests all combinations: 2 × 3 × 4 = 24 parameter combinations × 5 folds = 120 model fits

Model Comparison

After training both models, compare their AUC scores:

def get_best_model(self, train_x, train_y, test_x, test_y):
    self.logger_object.log(self.file_object,
                          'Entered the get_best_model method of the Model_Finder class')
    
    try:
        # Create best model for XGBoost
        self.xgboost = self.get_best_params_for_xgboost(train_x, train_y)
        self.prediction_xgboost = self.xgboost.predict(test_x)
        
        # Calculate XGBoost score
        if len(test_y.unique()) == 1:
            # If there is only one label in y, use accuracy instead of AUC
            self.xgboost_score = accuracy_score(test_y, self.prediction_xgboost)
            self.logger_object.log(self.file_object, 
                                  'Accuracy for XGBoost:' + str(self.xgboost_score))
        else:
            # Use AUC score for normal cases
            self.xgboost_score = roc_auc_score(test_y, self.prediction_xgboost)
            self.logger_object.log(self.file_object, 
                                  'AUC for XGBoost:' + str(self.xgboost_score))
        
        # Create best model for SVM
        self.svm = self.get_best_params_for_svm(train_x, train_y)
        self.prediction_svm = self.svm.predict(test_x)
        
        # Calculate SVM score
        if len(test_y.unique()) == 1:
            self.svm_score = accuracy_score(test_y, self.prediction_svm)
            self.logger_object.log(self.file_object, 
                                  'Accuracy for SVM:' + str(self.svm_score))
        else:
            self.svm_score = roc_auc_score(test_y, self.prediction_svm)
            self.logger_object.log(self.file_object, 
                                  'AUC for SVM:' + str(self.svm_score))
        
        # Comparing the two models
        if(self.svm_score < self.xgboost_score):
            return 'XGBoost', self.xgboost
        else:
            return 'SVM', self.sv_classifier
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in get_best_model method of the Model_Finder class. Exception message: ' + str(e))
        raise Exception()

AUC Score vs Accuracy

Special Case Handling: If the test set contains only one class (all fraud or all legitimate), the system falls back to accuracy score because AUC cannot be calculated.

if len(test_y.unique()) == 1:
    # Use accuracy_score
else:
    # Use roc_auc_score

Evaluation Metrics

AUC (Area Under ROC Curve)

Primary Metric: Used for model comparison

Range: 0.0 to 1.0
Interpretation:
- 0.5 = Random guessing
- 0.7-0.8 = Acceptable
- 0.8-0.9 = Excellent
- 0.9+ = Outstanding
Advantage: Threshold-independent, evaluates across all classification thresholds

Why AUC for Fraud Detection?

Class Imbalance: Fraud cases are rare; AUC handles imbalanced datasets well
Threshold Flexibility: Allows adjusting sensitivity vs. specificity based on business needs
Comprehensive: Evaluates model performance across all possible thresholds

Model Selection Example

Cluster 0 Results

XGBoost Training:
├─ Best Params: {n_estimators: 130, criterion: 'gini', max_depth: 9}
├─ Training Time: 45s
└─ AUC Score: 0.94

SVM Training:
├─ Best Params: {kernel: 'rbf', C: 1.0, random_state: 100}
├─ Training Time: 78s
└─ AUC Score: 0.89

🏆 Winner: XGBoost (saved as XGBoost0)

Cluster 1 Results

XGBoost Training:
├─ Best Params: {n_estimators: 100, criterion: 'entropy', max_depth: 8}
├─ Training Time: 38s
└─ AUC Score: 0.88

SVM Training:
├─ Best Params: {kernel: 'rbf', C: 0.5, random_state: 200}
├─ Training Time: 65s
└─ AUC Score: 0.91

🏆 Winner: SVM (saved as SVM1)

Hyperparameter Tuning Comparison

Model	Parameters Tested	Combinations	CV Folds	Total Fits
XGBoost	3 (n_estimators, criterion, max_depth)	8	5	40
SVM	3 (kernel, C, random_state)	24	5	120

Grid Search Progress

With verbose=3, you’ll see detailed output:

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END ...C=0.1, kernel=rbf, random_state=0;, score=0.847
[CV 2/5] END ...C=0.1, kernel=rbf, random_state=0;, score=0.852
[CV 3/5] END ...C=0.1, kernel=rbf, random_state=0;, score=0.839
...

Best Practices

Cross-Validation

Use 5-fold CV to ensure parameter selection is robust

Multiple Algorithms

Compare different algorithms (XGBoost vs SVM) rather than assuming one is best

Comprehensive Grid

Test multiple parameter values to explore the parameter space

Log Everything

Record best parameters and scores for reproducibility and debugging

Model Selection Output

The get_best_model() method returns:

best_model_name, best_model = model_finder.get_best_model(
    x_train, y_train, x_test, y_test
)

# Returns:
# best_model_name: 'XGBoost' or 'SVM'
# best_model: Trained model object ready for prediction

This is then saved with cluster ID:

file_op.save_model(best_model, best_model_name + str(cluster_id))
# Examples: 'XGBoost0', 'SVM1', 'XGBoost2'

Performance Considerations

Training Time Factors

GridSearchCV: Tests many parameter combinations (8-24 combinations)
Cross-Validation: Each combination tested 5 times
Dataset Size: Larger clusters take longer to train
Model Complexity: SVM typically slower than XGBoost

Optimization Tips

# XGBoost uses all cores
XGBClassifier(objective='binary:logistic', n_jobs=-1)

# GridSearchCV can parallelize
GridSearchCV(..., n_jobs=-1)  # Add this parameter

Next Steps

After model selection:

Best model for each cluster is saved to disk
Models are ready for deployment and prediction
Review AUC scores to assess model quality
Consider retraining if AUC scores are below acceptable thresholds

Summary

The model selection process:

Trains two algorithms (XGBoost and SVM) per cluster
Uses GridSearchCV with 5-fold cross-validation for hyperparameter tuning
Compares models using AUC score
Selects and saves the best performing model for each cluster
Logs all parameters and scores for transparency and reproducibility

Get Started

Core Concepts

Training

Prediction

Overview

Model_Finder Class

Model Selection Pipeline

XGBoost Hyperparameter Tuning

Parameter Grid

GridSearchCV for XGBoost

SVM Hyperparameter Tuning

Parameter Grid

GridSearchCV for SVM

Model Comparison

AUC Score vs Accuracy

Evaluation Metrics

AUC (Area Under ROC Curve)

Why AUC for Fraud Detection?

Model Selection Example

Cluster 0 Results

Cluster 1 Results

Hyperparameter Tuning Comparison

Grid Search Progress

Best Practices

Model Selection Output

Performance Considerations

Training Time Factors

Optimization Tips

Next Steps

Summary

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training

Prediction

Documentation Index

​Overview

​Model_Finder Class

​Model Selection Pipeline

​XGBoost Hyperparameter Tuning

​Parameter Grid

​GridSearchCV for XGBoost

​SVM Hyperparameter Tuning

​Parameter Grid

​GridSearchCV for SVM

​Model Comparison

​AUC Score vs Accuracy

​Evaluation Metrics

​AUC (Area Under ROC Curve)

​Why AUC for Fraud Detection?

​Model Selection Example

​Cluster 0 Results

​Cluster 1 Results

​Hyperparameter Tuning Comparison

​Grid Search Progress

​Best Practices

​Model Selection Output

​Performance Considerations

​Training Time Factors

​Optimization Tips

​Next Steps

​Summary

Build docs developers (and LLMs) love

Overview

Model_Finder Class

Model Selection Pipeline

XGBoost Hyperparameter Tuning

Parameter Grid

GridSearchCV for XGBoost

SVM Hyperparameter Tuning

Parameter Grid

GridSearchCV for SVM

Model Comparison

AUC Score vs Accuracy

Evaluation Metrics

AUC (Area Under ROC Curve)

Why AUC for Fraud Detection?

Model Selection Example

Cluster 0 Results

Cluster 1 Results

Hyperparameter Tuning Comparison

Grid Search Progress

Best Practices

Model Selection Output

Performance Considerations

Training Time Factors

Optimization Tips

Next Steps

Summary