Model evaluation: metrics and performance

Evaluating an insider threat model requires careful attention to class imbalance and operational trade-offs. A model that predicts “benign” for every record can achieve high accuracy while catching zero threats. ThreatDetect therefore tracks multiple complementary metrics and tunes its classification threshold explicitly against recall on the validation set, accepting a modest precision cost in exchange for fewer missed threats.

Classification metrics

The following metrics are computed using sklearn.metrics on the held-out test set after all training and threshold tuning is complete.

Metric	Function	What it measures
Accuracy	`accuracy_score`	Fraction of all records correctly classified
Precision	`precision_score`	Fraction of threat predictions that are correct
Recall	`recall_score`	Fraction of actual threats that are detected
F1	`f1_score`	Harmonic mean of precision and recall
Confusion matrix	`confusion_matrix`	Counts of true positives, false positives, true negatives, and false negatives
Cross-validation	`cross_val_score`	Mean F1 over k folds on the full dataset, used to check for overfitting

Insider threat datasets are heavily class-imbalanced — malicious records typically represent a small minority of all observations. In this context, F1 is the primary metric because it penalises both missing threats (low recall) and generating excessive false alarms (low precision). Accuracy alone is misleading on imbalanced data.

Precision-recall curve and threshold selection

A standard binary classifier predicts the positive class when the predicted probability exceeds 0.5. For insider threat detection, recall matters more than precision — a missed threat is a worse outcome than an unnecessary investigation. ThreatDetect selects best_threshold by evaluating the precision-recall curve on the validation set and choosing the threshold that best balances the two.

from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay

precision, recall, thresholds = precision_recall_curve(y_val, val_probabilities)

# Select the threshold that maximises the balance between precision and recall
# for the specific dataset and threat definition
best_threshold = thresholds[...]  # chosen from the curve

PrecisionRecallDisplay is used to plot the curve and visually confirm the selected operating point. The chosen threshold is stored in the best_threshold key of the model package and applied at inference time — predictions use this threshold rather than 0.5.

Changing best_threshold shifts the precision-recall trade-off for every subsequent prediction. Raising the threshold increases precision and reduces recall; lowering it does the opposite. Do not modify the stored threshold without re-evaluating on a representative held-out set.

Cross-validation

cross_val_score is run with scoring="f1" across k stratified folds on the full dataset after the final model is selected. This confirms that the F1 score is stable across different data splits and that the model has not overfitted to the specific train/validation/test partition used during development.

SHAP global summary

ThreatDetect uses shap.TreeExplainer to compute SHAP values for the XGBoost model. The global summary plot ranks features by their mean absolute SHAP value across the test set, which reveals which behavioural signals the model relies on most heavily. Reviewing this plot during model validation helps confirm that the model is responding to genuinely suspicious behaviour rather than spurious correlations in the training data. The shap_explainer object is pre-built and stored in the model package. At inference time, SHAP values are computed per-record and surfaced in the Streamlit UI to explain individual predictions.

import shap

explainer = model_package["shap_explainer"]
shap_values = explainer.shap_values(x_test_augmented)

# Global summary — mean absolute SHAP value per feature
shap.summary_plot(shap_values, x_test_augmented, plot_type="bar")

SHAP values are computed on the augmented feature matrix — the 26 behavioural and engineered features plus the Isolation Forest anomaly score column. The anomaly score often appears in the top features, confirming that unsupervised outlier detection adds signal beyond the labelled features alone.

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

Model evaluation: metrics and performance

Classification metrics

Precision-recall curve and threshold selection

Cross-validation

SHAP global summary

Build docs developers (and LLMs) love

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

Documentation Index

​Classification metrics

​Precision-recall curve and threshold selection

​Cross-validation

​SHAP global summary

Build docs developers (and LLMs) love

Classification metrics

Precision-recall curve and threshold selection

Cross-validation

SHAP global summary