ThreatDetect transforms raw employee behavioral records into a numeric feature matrix through three sequential steps: validation of required input columns, derivation of five composite ratio features, and preprocessing (encoding and scaling) using objects fitted during training. The resulting matrix — augmented with an Isolation Forest anomaly score — is what the XGBoost model receives at inference time.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt
Use this file to discover all available pages before exploring further.
Raw input features
The following columns must be present in every CSV you submit. These are the raw signals collected from employee records, access logs, and HR data. The model cannot run if any of these columns are missing.| Column | Type | Description |
|---|---|---|
employee_campus | Categorical | Campus or office location of the employee |
has_criminal_record | Binary (0/1) | Whether the employee has a known criminal record |
is_contractor | Binary (0/1) | Whether the employee is a contractor rather than permanent staff |
has_foreign_citizenship | Binary (0/1) | Whether the employee holds foreign citizenship |
total_printed_pages | Numeric | Total number of pages printed by the employee |
num_printed_pages_off_hours | Numeric | Pages printed outside standard working hours |
total_files_burned | Numeric | Total files copied to removable media (e.g., burned to disc) |
entry_during_weekend | Binary (0/1) | Whether the employee accessed the premises during a weekend |
late_exit_flag | Binary (0/1) | Whether the employee has recorded late exits from secured areas |
The model package stores the authoritative list of required columns in
feature_columns. The prepare_features function derives the required raw columns by excluding the five engineered features and isolation_forest_anomaly_score from that list.Engineered features
Five composite features are derived from the raw inputs to capture behavioral ratios and compound risk signals. These features are computed after input validation and before encoding or scaling.| Feature | Formula | What it captures |
|---|---|---|
print_ratio | total_printed_pages / num_printed_pages_off_hours | Proportion of all printing that occurs off-hours |
file_ratio | total_files_burned / num_printed_pages_off_hours | File exfiltration activity relative to off-hours printing volume |
risk_ratio | has_criminal_record + is_contractor + has_foreign_citizenship | Additive background risk score (0–3) |
access_ratio | num_printed_pages_off_hours × entry_during_weekend | Off-hours printing combined with weekend access |
afterhrs_ratio | late_exit_flag × num_printed_pages_off_hours | Off-hours printing combined with late exit behavior |
Handling inf and NaN values
print_ratio and file_ratio divide by num_printed_pages_off_hours, which can be zero. Any inf, -inf, or NaN values produced by the engineered feature calculations are replaced with the column median:
Preprocessing
After feature engineering, categorical and numeric columns are transformed using objects fitted during model training.Categorical encoding
Each categorical column (stored incat_cols in the model package, which includes employee_campus) is encoded with its corresponding LabelEncoder:
Numeric scaling
All numeric columns (stored innum_cols) are transformed using the StandardScaler fitted during training. This ensures that the scale of each feature at inference matches the scale the model was trained on:
NaN values in raw numeric columns are filled with the column median to prevent the scaler from failing on missing data:
Final feature matrix
After preprocessing, the feature matrix is assembled in the order defined byfeature_columns (excluding the last entry, isolation_forest_anomaly_score). The Isolation Forest then scores the matrix and its output is appended as the final column before the XGBoost model receives the data: