ThreatDetect uses a stacked ensemble that combines unsupervised anomaly detection with gradient-boosted classification. The Isolation Forest produces an anomaly score that is appended directly to the XGBoost feature matrix, letting the classifier learn how to weight that signal alongside the original behavioural features. The pre-trained model ships with the application, so no retraining is required to run predictions.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt
Use this file to discover all available pages before exploring further.
The pre-trained model is already included at
AI_Model_Code/insider_threat_model.pkl. You can run predictions immediately without retraining.Training pipeline
The pipeline executes the following stages in order. Each fitted object is stored in the model package so that inference-time data is transformed identically to the training data.Categorical encoding
Each categorical column (
employee_campus) is encoded with a sklearn.preprocessing.LabelEncoder fitted on the training set. The fitted encoders are saved in the label_encoders key of the model package.Numeric scaling
All numeric and binary columns are scaled with a
sklearn.preprocessing.StandardScaler fitted on the training set. The fitted scaler is saved in the scaler key.Isolation Forest training
A
sklearn.ensemble.IsolationForest is trained on the scaled training features. It learns the density of normal behaviour without using the labels, making it sensitive to outliers that may not appear in the labelled training data.Anomaly score augmentation
IsolationForest.decision_function is called on each split — train, validation, and test — and the resulting score column is appended to the feature matrix. This augmented matrix is the input to XGBoost.XGBoost training
An
xgboost.XGBClassifier is trained on the augmented feature matrix. XGBoost learns how to weight the Isolation Forest anomaly score alongside the behavioural features, rather than combining the two models with a fixed rule.Threshold selection
A classification threshold (
best_threshold) is derived from the precision-recall curve on the validation set and stored in the model package. See the evaluation page for details on how the threshold is chosen.Model package
The complete pipeline is serialised toAI_Model_Code/insider_threat_model.pkl as a Python dictionary. Loading this single file gives you every object needed for inference.
| Key | Contents |
|---|---|
xgb_model | Fitted XGBClassifier |
shap_explainer | shap.TreeExplainer built on the fitted XGBoost model |
iso_forest | Fitted IsolationForest |
scaler | Fitted StandardScaler |
label_encoders | Dictionary of fitted LabelEncoder objects, keyed by column name |
feature_columns | Ordered list of all feature names, including the appended anomaly score |
cat_cols | List of categorical column names |
bin_cols | List of binary column names |
num_cols | List of numeric column names (raw and engineered) |
best_threshold | Tuned classification threshold stored as a float |
@st.cache_resource and reuses it across all prediction requests.