ThreatDetect processes raw employee behavioral data through a four-stage pipeline that produces a risk probability, a binary classification label, and a plain-language explanation for every record. The pipeline is designed to handle organizational-scale CSV uploads or single-employee queries, and every prediction is traceable to the individual features that drove it.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt
Use this file to discover all available pages before exploring further.
Detection pipeline
Data ingestion
You upload a CSV file containing raw employee and behavioral records. ThreatDetect validates that all required columns are present before proceeding. Missing columns raise a descriptive error that lists exactly which fields are absent, so you can correct your input data without guessing.The app accepts files through the Organisational Search via CSV page (batch) or the Single Search page (one record at a time).
Feature engineering
Raw input columns are preprocessed and five behavioral ratio features are derived from them. Categorical columns are string-stripped and label-encoded using the encoders saved in the model package. Numeric columns are scaled with the fitted
StandardScaler. Engineered features that produce inf or NaN values (for example, a division by zero in print_ratio) are replaced with the column median.See Feature engineering for the full list of inputs and the exact derivation formulas.Model inference
The preprocessed feature matrix is passed through two models in sequence:
- Isolation Forest — computes an anomaly score (
decision_function) for each record and appends it to the feature matrix asisolation_forest_anomaly_score. - XGBoost classifier — receives the augmented feature matrix and outputs a probability of malicious behavior for each record.
best_threshold (stored in the model package). Records at or above the threshold are labeled Malicious; records below are labeled Normal. Confidence is reported as the probability distance from the decision boundary: prob for malicious predictions and 1 - prob for normal ones.Explainability
SHAP TreeExplainer computes feature contributions for each prediction. Positive SHAP values push the prediction toward Malicious; negative values push it toward Normal. ThreatDetect surfaces two levels of explanation:
- Global — a SHAP summary plot over a random sample of up to 100 records, showing which features most consistently influence predictions across the organization.
- Per-instance — a bar chart of the top 10 SHAP contributors for any individual employee record, alongside a human-readable list of the features that increased or reduced their risk score.
Prediction output
Each analyzed record produces four output values:| Field | Description |
|---|---|
Prediction | Binary label: Malicious or Normal |
Risk_Prob | Raw XGBoost probability of malicious behavior (0–1) |
Anomaly_Score | Isolation Forest decision_function score; lower values indicate more anomalous behavior |
Confidence | Distance from the decision boundary: Risk_Prob for malicious records, 1 - Risk_Prob for normal ones |
Ensemble approach
ThreatDetect combines two complementary models to improve detection robustness.XGBoost classifier
A gradient-boosted tree model trained on labeled insider threat data. It learns non-linear relationships between behavioral features and malicious outcomes, and produces a calibrated probability score.
Isolation Forest
An unsupervised anomaly detector that scores how isolated (unusual) each record is relative to the rest of the dataset. Its output is appended as an additional feature that the XGBoost model uses at inference time.
The Isolation Forest score is appended to the feature matrix after the XGBoost model was trained with it, so the two models work in a fixed sequence rather than as a true ensemble vote. The XGBoost model has learned to weight the anomaly score alongside the other features.