ThreatDetect’s prediction engine is a serialized model package loaded fromDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt
Use this file to discover all available pages before exploring further.
AI_Model_Code/insider_threat_model.pkl. The package bundles a trained XGBoost classifier, an Isolation Forest anomaly detector, a SHAP TreeExplainer, fitted preprocessing objects, and a tuned probability threshold — everything needed to reproduce the training-time pipeline at inference time without re-fitting any model.
Model package structure
The pickle file deserializes into a Python dictionary with the following keys:When
load_model() detects that shap_explainer is absent from the pickle (for example, if the file was saved before SHAP was added), it reconstructs the explainer on the fly with shap.TreeExplainer(model_package['xgb_model']).XGBoost classifier
XGBoost (xgb_model) is a gradient-boosted decision tree classifier trained on labeled employee records where each record is marked as malicious or normal. Gradient boosting is particularly effective on tabular security data because it handles mixed feature types (binary flags, counts, ratios), captures non-linear interactions between features, and produces calibrated probability outputs that can be thresholded at inference time.
At inference, xgb_model.predict_proba(x_append)[:, 1] returns the probability that each record belongs to the positive (malicious) class. The augmented feature matrix x_append includes the Isolation Forest score as its final column, so the classifier has learned to use anomaly information as a feature rather than treating it as an independent signal.
Isolation Forest
The Isolation Forest (iso_forest) is an unsupervised anomaly detector that scores each record by how easily it can be isolated from the rest of the dataset using random splits. Records that require fewer splits to isolate (i.e., they are very different from their neighbors) receive lower decision_function scores, indicating higher anomalousness.
In the ThreatDetect pipeline, the Isolation Forest plays a specific role: it does not classify records independently. Instead, its per-record score is computed and appended to the feature matrix before the XGBoost model receives it:
x_append is what XGBoost receives during prediction. The Isolation Forest score is also surfaced to users as the Anomaly_Score column in results output.
SHAP TreeExplainer
Theshap_explainer is a shap.TreeExplainer that wraps xgb_model. TreeExplainer computes exact SHAP values for tree-based models efficiently, making it suitable for real-time per-instance explanations in the Streamlit interface.
Global explanations
For organizational (batch) analysis, SHAP values are computed over a random sample of up to 100 records and displayed as asummary_plot:
Per-instance explanations
For individual records, the TreeExplainer returns SHAP values for both classes. ThreatDetect uses the class 1 (malicious) values:Probability threshold
best_threshold is a float stored in the model package that was tuned during training (typically on a precision-recall curve) to balance false positive and false negative rates for the specific dataset and threat definition used.
At inference, the threshold is applied directly: