Training the insider threat detection model

ThreatDetect uses a stacked ensemble that combines unsupervised anomaly detection with gradient-boosted classification. The Isolation Forest produces an anomaly score that is appended directly to the XGBoost feature matrix, letting the classifier learn how to weight that signal alongside the original behavioural features. The pre-trained model ships with the application, so no retraining is required to run predictions.

The pre-trained model is already included at AI_Model_Code/insider_threat_model.pkl. You can run predictions immediately without retraining.

Training pipeline

The pipeline executes the following stages in order. Each fitted object is stored in the model package so that inference-time data is transformed identically to the training data.

Categorical encoding

Each categorical column (employee_campus) is encoded with a sklearn.preprocessing.LabelEncoder fitted on the training set. The fitted encoders are saved in the label_encoders key of the model package.

Numeric scaling

All numeric and binary columns are scaled with a sklearn.preprocessing.StandardScaler fitted on the training set. The fitted scaler is saved in the scaler key.

Isolation Forest training

A sklearn.ensemble.IsolationForest is trained on the scaled training features. It learns the density of normal behaviour without using the labels, making it sensitive to outliers that may not appear in the labelled training data.

Anomaly score augmentation

IsolationForest.decision_function is called on each split — train, validation, and test — and the resulting score column is appended to the feature matrix. This augmented matrix is the input to XGBoost.

train_anomaly = iso_forest.decision_function(x_train).reshape(-1, 1)
x_append_train = np.hstack((x_train, train_anomaly))

XGBoost training

An xgboost.XGBClassifier is trained on the augmented feature matrix. XGBoost learns how to weight the Isolation Forest anomaly score alongside the behavioural features, rather than combining the two models with a fixed rule.

Threshold selection

A classification threshold (best_threshold) is derived from the precision-recall curve on the validation set and stored in the model package. See the evaluation page for details on how the threshold is chosen.

Model package

The complete pipeline is serialised to AI_Model_Code/insider_threat_model.pkl as a Python dictionary. Loading this single file gives you every object needed for inference.

Key	Contents
`xgb_model`	Fitted `XGBClassifier`
`shap_explainer`	`shap.TreeExplainer` built on the fitted XGBoost model
`iso_forest`	Fitted `IsolationForest`
`scaler`	Fitted `StandardScaler`
`label_encoders`	Dictionary of fitted `LabelEncoder` objects, keyed by column name
`feature_columns`	Ordered list of all feature names, including the appended anomaly score
`cat_cols`	List of categorical column names
`bin_cols`	List of binary column names
`num_cols`	List of numeric column names (raw and engineered)
`best_threshold`	Tuned classification threshold stored as a float

The Streamlit app loads this package once at startup with @st.cache_resource and reuses it across all prediction requests.

To retrain the model on new data, open AI_Model_Code/cos720_ai_model_FINAL.ipynb and run all cells in order. Replace the source CSV with your own labelled dataset before running. The notebook handles all preprocessing, hyperparameter search, evaluation, and serialisation steps.

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

Training the insider threat detection model

Training pipeline

Model package

Build docs developers (and LLMs) love

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

Documentation Index

​Training pipeline

​Model package

Build docs developers (and LLMs) love

Training pipeline

Model package