Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt

Use this file to discover all available pages before exploring further.

ThreatDetect uses a stacked ensemble that combines unsupervised anomaly detection with gradient-boosted classification. The Isolation Forest produces an anomaly score that is appended directly to the XGBoost feature matrix, letting the classifier learn how to weight that signal alongside the original behavioural features. The pre-trained model ships with the application, so no retraining is required to run predictions.
The pre-trained model is already included at AI_Model_Code/insider_threat_model.pkl. You can run predictions immediately without retraining.

Training pipeline

The pipeline executes the following stages in order. Each fitted object is stored in the model package so that inference-time data is transformed identically to the training data.
1

Categorical encoding

Each categorical column (employee_campus) is encoded with a sklearn.preprocessing.LabelEncoder fitted on the training set. The fitted encoders are saved in the label_encoders key of the model package.
2

Numeric scaling

All numeric and binary columns are scaled with a sklearn.preprocessing.StandardScaler fitted on the training set. The fitted scaler is saved in the scaler key.
3

Isolation Forest training

A sklearn.ensemble.IsolationForest is trained on the scaled training features. It learns the density of normal behaviour without using the labels, making it sensitive to outliers that may not appear in the labelled training data.
4

Anomaly score augmentation

IsolationForest.decision_function is called on each split — train, validation, and test — and the resulting score column is appended to the feature matrix. This augmented matrix is the input to XGBoost.
train_anomaly = iso_forest.decision_function(x_train).reshape(-1, 1)
x_append_train = np.hstack((x_train, train_anomaly))
5

XGBoost training

An xgboost.XGBClassifier is trained on the augmented feature matrix. XGBoost learns how to weight the Isolation Forest anomaly score alongside the behavioural features, rather than combining the two models with a fixed rule.
6

Threshold selection

A classification threshold (best_threshold) is derived from the precision-recall curve on the validation set and stored in the model package. See the evaluation page for details on how the threshold is chosen.

Model package

The complete pipeline is serialised to AI_Model_Code/insider_threat_model.pkl as a Python dictionary. Loading this single file gives you every object needed for inference.
KeyContents
xgb_modelFitted XGBClassifier
shap_explainershap.TreeExplainer built on the fitted XGBoost model
iso_forestFitted IsolationForest
scalerFitted StandardScaler
label_encodersDictionary of fitted LabelEncoder objects, keyed by column name
feature_columnsOrdered list of all feature names, including the appended anomaly score
cat_colsList of categorical column names
bin_colsList of binary column names
num_colsList of numeric column names (raw and engineered)
best_thresholdTuned classification threshold stored as a float
The Streamlit app loads this package once at startup with @st.cache_resource and reuses it across all prediction requests.
To retrain the model on new data, open AI_Model_Code/cos720_ai_model_FINAL.ipynb and run all cells in order. Replace the source CSV with your own labelled dataset before running. The notebook handles all preprocessing, hyperparameter search, evaluation, and serialisation steps.

Build docs developers (and LLMs) love