How ThreatDetect detects insider threats

ThreatDetect processes raw employee behavioral data through a four-stage pipeline that produces a risk probability, a binary classification label, and a plain-language explanation for every record. The pipeline is designed to handle organizational-scale CSV uploads or single-employee queries, and every prediction is traceable to the individual features that drove it.

Detection pipeline

Data ingestion

You upload a CSV file containing raw employee and behavioral records. ThreatDetect validates that all required columns are present before proceeding. Missing columns raise a descriptive error that lists exactly which fields are absent, so you can correct your input data without guessing.The app accepts files through the Organisational Search via CSV page (batch) or the Single Search page (one record at a time).

Feature engineering

Raw input columns are preprocessed and five behavioral ratio features are derived from them. Categorical columns are string-stripped and label-encoded using the encoders saved in the model package. Numeric columns are scaled with the fitted StandardScaler. Engineered features that produce inf or NaN values (for example, a division by zero in print_ratio) are replaced with the column median.See Feature engineering for the full list of inputs and the exact derivation formulas.

Model inference

The preprocessed feature matrix is passed through two models in sequence:

Isolation Forest — computes an anomaly score (decision_function) for each record and appends it to the feature matrix as isolation_forest_anomaly_score.
XGBoost classifier — receives the augmented feature matrix and outputs a probability of malicious behavior for each record.

The probability is compared against best_threshold (stored in the model package). Records at or above the threshold are labeled Malicious; records below are labeled Normal. Confidence is reported as the probability distance from the decision boundary: prob for malicious predictions and 1 - prob for normal ones.

Explainability

SHAP TreeExplainer computes feature contributions for each prediction. Positive SHAP values push the prediction toward Malicious; negative values push it toward Normal. ThreatDetect surfaces two levels of explanation:

Global — a SHAP summary plot over a random sample of up to 100 records, showing which features most consistently influence predictions across the organization.
Per-instance — a bar chart of the top 10 SHAP contributors for any individual employee record, alongside a human-readable list of the features that increased or reduced their risk score.

Prediction output

Each analyzed record produces four output values:

Field	Description
`Prediction`	Binary label: `Malicious` or `Normal`
`Risk_Prob`	Raw XGBoost probability of malicious behavior (0–1)
`Anomaly_Score`	Isolation Forest `decision_function` score; lower values indicate more anomalous behavior
`Confidence`	Distance from the decision boundary: `Risk_Prob` for malicious records, `1 - Risk_Prob` for normal ones

These four fields appear in the on-screen results table and in the downloadable CSV produced by batch analysis.

Ensemble approach

ThreatDetect combines two complementary models to improve detection robustness.

XGBoost classifier

A gradient-boosted tree model trained on labeled insider threat data. It learns non-linear relationships between behavioral features and malicious outcomes, and produces a calibrated probability score.

Isolation Forest

An unsupervised anomaly detector that scores how isolated (unusual) each record is relative to the rest of the dataset. Its output is appended as an additional feature that the XGBoost model uses at inference time.

The Isolation Forest score is appended to the feature matrix after the XGBoost model was trained with it, so the two models work in a fixed sequence rather than as a true ensemble vote. The XGBoost model has learned to weight the anomaly score alongside the other features.

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

How ThreatDetect detects insider threats

Detection pipeline

Prediction output

Ensemble approach

XGBoost classifier

Isolation Forest

Build docs developers (and LLMs) love

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

Documentation Index

​Detection pipeline

​Prediction output

​Ensemble approach

XGBoost classifier

Isolation Forest

Build docs developers (and LLMs) love

Detection pipeline

Prediction output

Ensemble approach