Skip to main content

Overview

The Model_Finder class compares two machine learning algorithms (XGBoost and SVM) for each cluster and selects the best performer based on AUC score. Both models undergo hyperparameter tuning using GridSearchCV.

Model_Finder Class

Implemented in best_model_finder/tuner.py:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

class Model_Finder:
    def __init__(self, file_object, logger_object):
        self.file_object = file_object
        self.logger_object = logger_object
        self.sv_classifier = SVC()
        self.xgb = XGBClassifier(objective='binary:logistic', n_jobs=-1)

Model Selection Pipeline

1

Train XGBoost

Tune hyperparameters and train XGBoost model
2

Evaluate XGBoost

Calculate AUC score on test set
3

Train SVM

Tune hyperparameters and train SVM model
4

Evaluate SVM

Calculate AUC score on test set
5

Compare Models

Select model with higher AUC score

XGBoost Hyperparameter Tuning

Parameter Grid

def get_best_params_for_xgboost(self, train_x, train_y):
    self.logger_object.log(self.file_object,
                          'Entered the get_best_params_for_xgboost method of the Model_Finder class')
    try:
        # Initializing with different combination of parameters
        self.param_grid_xgboost = {
            "n_estimators": [100, 130],
            "criterion": ['gini', 'entropy'],
            "max_depth": range(8, 10, 1)  # [8, 9]
        }
n_estimators
list
Number of boosting rounds
  • 100: Faster training, may underfit
  • 130: More iterations, better fit
criterion
list
Split quality measure
  • gini: Gini impurity (faster)
  • entropy: Information gain (more computationally expensive)
max_depth
range
Maximum tree depth
  • 8: Shallower trees, less overfitting
  • 9: Deeper trees, more complex patterns

GridSearchCV for XGBoost

        # Creating an object of the Grid Search class
        self.grid = GridSearchCV(
            XGBClassifier(objective='binary:logistic'),
            self.param_grid_xgboost,
            verbose=3,
            cv=5
        )
        
        # Finding the best parameters
        self.grid.fit(train_x, train_y)
        
        # Extracting the best parameters
        self.criterion = self.grid.best_params_['criterion']
        self.max_depth = self.grid.best_params_['max_depth']
        self.n_estimators = self.grid.best_params_['n_estimators']
        
        # Creating a new model with the best parameters
        self.xgb = XGBClassifier(
            criterion=self.criterion,
            max_depth=self.max_depth,
            n_estimators=self.n_estimators,
            n_jobs=-1
        )
        
        # Training the new model
        self.xgb.fit(train_x, train_y)
        
        self.logger_object.log(self.file_object,
                              'XGBoost best params: ' + str(self.grid.best_params_) + 
                              '. Exited the get_best_params_for_xgboost method of the Model_Finder class')
        return self.xgb
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in get_best_params_for_xgboost method of the Model_Finder class. Exception message: ' + str(e))
        raise Exception()
GridSearchCV Configuration:
  • cv=5: 5-fold cross-validation for robust parameter selection
  • verbose=3: Detailed logging of grid search progress
  • objective=‘binary:logistic’: Binary classification task
  • n_jobs=-1: Use all CPU cores for parallel training
GridSearchCV tests all combinations: 2 × 2 × 2 = 8 parameter combinations × 5 folds = 40 model fits

SVM Hyperparameter Tuning

Parameter Grid

def get_best_params_for_svm(self, train_x, train_y):
    self.logger_object.log(self.file_object, 
                          'Entered the get_best_params_for_svm method of the Model_Finder class')
    try:
        # Initializing with different combination of parameters
        self.param_grid = {
            "kernel": ['rbf', 'sigmoid'],
            "C": [0.1, 0.5, 1.0],
            "random_state": [0, 100, 200, 300]
        }
kernel
list
Kernel function for transforming input space
  • rbf: Radial Basis Function (good for non-linear patterns)
  • sigmoid: Sigmoid kernel (neural network-like)
C
list
Regularization parameter (controls overfitting)
  • 0.1: Strong regularization (simpler model)
  • 0.5: Moderate regularization
  • 1.0: Weak regularization (more complex model)
random_state
list
Random seed for reproducibility
  • Tests multiple random states for robustness

GridSearchCV for SVM

        # Creating an object of the Grid Search class
        self.grid = GridSearchCV(
            estimator=self.sv_classifier,
            param_grid=self.param_grid,
            cv=5,
            verbose=3
        )
        
        # Finding the best parameters
        self.grid.fit(train_x, train_y)
        
        # Extracting the best parameters
        self.kernel = self.grid.best_params_['kernel']
        self.C = self.grid.best_params_['C']
        self.random_state = self.grid.best_params_['random_state']
        
        # Creating a new model with the best parameters
        self.sv_classifier = SVC(
            kernel=self.kernel,
            C=self.C,
            random_state=self.random_state
        )
        
        # Training the new model
        self.sv_classifier.fit(train_x, train_y)
        
        self.logger_object.log(self.file_object,
                              'SVM best params: ' + str(self.grid.best_params_) + 
                              '. Exited the get_best_params_for_svm method of the Model_Finder class')
        return self.sv_classifier
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in get_best_params_for_svm method of the Model_Finder class. Exception message: ' + str(e))
        raise Exception()
GridSearchCV Configuration:
  • cv=5: 5-fold cross-validation
  • verbose=3: Detailed progress logging
GridSearchCV tests all combinations: 2 × 3 × 4 = 24 parameter combinations × 5 folds = 120 model fits

Model Comparison

After training both models, compare their AUC scores:
def get_best_model(self, train_x, train_y, test_x, test_y):
    self.logger_object.log(self.file_object,
                          'Entered the get_best_model method of the Model_Finder class')
    
    try:
        # Create best model for XGBoost
        self.xgboost = self.get_best_params_for_xgboost(train_x, train_y)
        self.prediction_xgboost = self.xgboost.predict(test_x)
        
        # Calculate XGBoost score
        if len(test_y.unique()) == 1:
            # If there is only one label in y, use accuracy instead of AUC
            self.xgboost_score = accuracy_score(test_y, self.prediction_xgboost)
            self.logger_object.log(self.file_object, 
                                  'Accuracy for XGBoost:' + str(self.xgboost_score))
        else:
            # Use AUC score for normal cases
            self.xgboost_score = roc_auc_score(test_y, self.prediction_xgboost)
            self.logger_object.log(self.file_object, 
                                  'AUC for XGBoost:' + str(self.xgboost_score))
        
        # Create best model for SVM
        self.svm = self.get_best_params_for_svm(train_x, train_y)
        self.prediction_svm = self.svm.predict(test_x)
        
        # Calculate SVM score
        if len(test_y.unique()) == 1:
            self.svm_score = accuracy_score(test_y, self.prediction_svm)
            self.logger_object.log(self.file_object, 
                                  'Accuracy for SVM:' + str(self.svm_score))
        else:
            self.svm_score = roc_auc_score(test_y, self.prediction_svm)
            self.logger_object.log(self.file_object, 
                                  'AUC for SVM:' + str(self.svm_score))
        
        # Comparing the two models
        if(self.svm_score < self.xgboost_score):
            return 'XGBoost', self.xgboost
        else:
            return 'SVM', self.sv_classifier
        
    except Exception as e:
        self.logger_object.log(self.file_object,
                              'Exception occured in get_best_model method of the Model_Finder class. Exception message: ' + str(e))
        raise Exception()

AUC Score vs Accuracy

Special Case Handling: If the test set contains only one class (all fraud or all legitimate), the system falls back to accuracy score because AUC cannot be calculated.
if len(test_y.unique()) == 1:
    # Use accuracy_score
else:
    # Use roc_auc_score

Evaluation Metrics

AUC (Area Under ROC Curve)

Primary Metric: Used for model comparison
  • Range: 0.0 to 1.0
  • Interpretation:
    • 0.5 = Random guessing
    • 0.7-0.8 = Acceptable
    • 0.8-0.9 = Excellent
    • 0.9+ = Outstanding
  • Advantage: Threshold-independent, evaluates across all classification thresholds

Why AUC for Fraud Detection?

  1. Class Imbalance: Fraud cases are rare; AUC handles imbalanced datasets well
  2. Threshold Flexibility: Allows adjusting sensitivity vs. specificity based on business needs
  3. Comprehensive: Evaluates model performance across all possible thresholds

Model Selection Example

Cluster 0 Results

XGBoost Training:
├─ Best Params: {n_estimators: 130, criterion: 'gini', max_depth: 9}
├─ Training Time: 45s
└─ AUC Score: 0.94

SVM Training:
├─ Best Params: {kernel: 'rbf', C: 1.0, random_state: 100}
├─ Training Time: 78s
└─ AUC Score: 0.89

🏆 Winner: XGBoost (saved as XGBoost0)

Cluster 1 Results

XGBoost Training:
├─ Best Params: {n_estimators: 100, criterion: 'entropy', max_depth: 8}
├─ Training Time: 38s
└─ AUC Score: 0.88

SVM Training:
├─ Best Params: {kernel: 'rbf', C: 0.5, random_state: 200}
├─ Training Time: 65s
└─ AUC Score: 0.91

🏆 Winner: SVM (saved as SVM1)

Hyperparameter Tuning Comparison

ModelParameters TestedCombinationsCV FoldsTotal Fits
XGBoost3 (n_estimators, criterion, max_depth)8540
SVM3 (kernel, C, random_state)245120

Grid Search Progress

With verbose=3, you’ll see detailed output:
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END ...C=0.1, kernel=rbf, random_state=0;, score=0.847
[CV 2/5] END ...C=0.1, kernel=rbf, random_state=0;, score=0.852
[CV 3/5] END ...C=0.1, kernel=rbf, random_state=0;, score=0.839
...

Best Practices

1

Cross-Validation

Use 5-fold CV to ensure parameter selection is robust
2

Multiple Algorithms

Compare different algorithms (XGBoost vs SVM) rather than assuming one is best
3

Comprehensive Grid

Test multiple parameter values to explore the parameter space
4

Log Everything

Record best parameters and scores for reproducibility and debugging

Model Selection Output

The get_best_model() method returns:
best_model_name, best_model = model_finder.get_best_model(
    x_train, y_train, x_test, y_test
)

# Returns:
# best_model_name: 'XGBoost' or 'SVM'
# best_model: Trained model object ready for prediction
This is then saved with cluster ID:
file_op.save_model(best_model, best_model_name + str(cluster_id))
# Examples: 'XGBoost0', 'SVM1', 'XGBoost2'

Performance Considerations

Training Time Factors

  • GridSearchCV: Tests many parameter combinations (8-24 combinations)
  • Cross-Validation: Each combination tested 5 times
  • Dataset Size: Larger clusters take longer to train
  • Model Complexity: SVM typically slower than XGBoost

Optimization Tips

# XGBoost uses all cores
XGBClassifier(objective='binary:logistic', n_jobs=-1)

# GridSearchCV can parallelize
GridSearchCV(..., n_jobs=-1)  # Add this parameter

Next Steps

After model selection:
  1. Best model for each cluster is saved to disk
  2. Models are ready for deployment and prediction
  3. Review AUC scores to assess model quality
  4. Consider retraining if AUC scores are below acceptable thresholds

Summary

The model selection process:
  • Trains two algorithms (XGBoost and SVM) per cluster
  • Uses GridSearchCV with 5-fold cross-validation for hyperparameter tuning
  • Compares models using AUC score
  • Selects and saves the best performing model for each cluster
  • Logs all parameters and scores for transparency and reproducibility

Build docs developers (and LLMs) love