Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt

Use this file to discover all available pages before exploring further.

ThreatDetect’s prediction engine is a serialized model package loaded from AI_Model_Code/insider_threat_model.pkl. The package bundles a trained XGBoost classifier, an Isolation Forest anomaly detector, a SHAP TreeExplainer, fitted preprocessing objects, and a tuned probability threshold — everything needed to reproduce the training-time pipeline at inference time without re-fitting any model.

Model package structure

The pickle file deserializes into a Python dictionary with the following keys:
model_package = {
    "xgb_model":              ...,  # trained XGBoost classifier
    "shap_explainer":         ...,  # shap.TreeExplainer wrapping xgb_model
    "cat_cols":               [...], # categorical column names (e.g. employee_campus)
    "bin_cols":               [...], # binary column names
    "num_cols":               [...], # numeric column names
    "feature_columns":        [...], # ordered list of all features; last entry is
                                     # isolation_forest_anomaly_score
    "label_encoders":         {...}, # dict of LabelEncoder, one per cat_col
    "scaler":                 ...,  # StandardScaler fitted on num_cols
    "iso_forest":             ...,  # fitted IsolationForest
    "best_threshold":         ...,  # float; probability cutoff for Malicious label
}
When load_model() detects that shap_explainer is absent from the pickle (for example, if the file was saved before SHAP was added), it reconstructs the explainer on the fly with shap.TreeExplainer(model_package['xgb_model']).

XGBoost classifier

XGBoost (xgb_model) is a gradient-boosted decision tree classifier trained on labeled employee records where each record is marked as malicious or normal. Gradient boosting is particularly effective on tabular security data because it handles mixed feature types (binary flags, counts, ratios), captures non-linear interactions between features, and produces calibrated probability outputs that can be thresholded at inference time. At inference, xgb_model.predict_proba(x_append)[:, 1] returns the probability that each record belongs to the positive (malicious) class. The augmented feature matrix x_append includes the Isolation Forest score as its final column, so the classifier has learned to use anomaly information as a feature rather than treating it as an independent signal.

Isolation Forest

The Isolation Forest (iso_forest) is an unsupervised anomaly detector that scores each record by how easily it can be isolated from the rest of the dataset using random splits. Records that require fewer splits to isolate (i.e., they are very different from their neighbors) receive lower decision_function scores, indicating higher anomalousness. In the ThreatDetect pipeline, the Isolation Forest plays a specific role: it does not classify records independently. Instead, its per-record score is computed and appended to the feature matrix before the XGBoost model receives it:
feature_cols = model_package['feature_columns'][:-1]   # all features except iso score
x_for_iso = df[feature_cols].to_numpy()
iso_scores = model_package['iso_forest'].decision_function(x_for_iso).reshape(-1, 1)
x_append = np.hstack((x_for_iso, iso_scores))         # iso score becomes last column
The combined matrix x_append is what XGBoost receives during prediction. The Isolation Forest score is also surfaced to users as the Anomaly_Score column in results output.

SHAP TreeExplainer

The shap_explainer is a shap.TreeExplainer that wraps xgb_model. TreeExplainer computes exact SHAP values for tree-based models efficiently, making it suitable for real-time per-instance explanations in the Streamlit interface.

Global explanations

For organizational (batch) analysis, SHAP values are computed over a random sample of up to 100 records and displayed as a summary_plot:
if len(x_append) > 100:
    sample_idx = np.random.choice(len(x_append), 100, replace=False)
    x_sample = x_append[sample_idx]
else:
    x_sample = x_append

shap_values_sample = explainer.shap_values(x_sample)
shap.summary_plot(shap_vals, x_sample, feature_names=full_feature_names, max_display=15)
The summary plot shows the direction and magnitude of each feature’s influence across all sampled records, helping you identify which behavioral signals drive malicious predictions across your organization.

Per-instance explanations

For individual records, the TreeExplainer returns SHAP values for both classes. ThreatDetect uses the class 1 (malicious) values:
shap_vals_list = explainer.shap_values(x_append_row.reshape(1, -1))
# binary classification: shap_vals_list is [class0_shap, class1_shap]
if isinstance(shap_vals_list, list) and len(shap_vals_list) == 2:
    shap_values = shap_vals_list[1][0]   # class 1 (malicious), first row
else:
    shap_values = shap_vals_list[0]      # fallback for single-output models
Features with positive SHAP values push the prediction toward Malicious. Features with negative SHAP values push it toward Normal. The top 10 contributors are displayed as a horizontal bar chart, and the top 5 in each direction are rendered as human-readable bullets in the risk explanation panel.

Probability threshold

best_threshold is a float stored in the model package that was tuned during training (typically on a precision-recall curve) to balance false positive and false negative rates for the specific dataset and threat definition used. At inference, the threshold is applied directly:
threshold = model['best_threshold']
probs = xgb_model.predict_proba(x_append)[:, 1]
preds = (probs >= threshold).astype(int)   # 1 = Malicious, 0 = Normal
The threshold is not hardcoded — it is read from the model package at load time. If you retrain the model with a different threshold, the updated value will be picked up automatically without any code changes.
Do not compare risk probabilities across model versions unless you also account for the best_threshold. A probability of 0.6 may mean very different things depending on how the threshold was calibrated.

Build docs developers (and LLMs) love