ThreatDetect accepts a single CSV file where each row represents one employee observation period. The pipeline validates column presence, encodes the categorical campus column using the label encoder fitted at training time, scales numeric values with the storedDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt
Use this file to discover all available pages before exploring further.
StandardScaler, and then derives five additional features automatically — you never need to compute them yourself. Understanding the expected schema helps you catch data issues before they reach the model.
Required input columns
Your CSV must include all nine columns listed below. Column names are case-sensitive and must match exactly. Extra columns are ignored; any missing column raises aKeyError before inference begins.
| Column | Type | Description | Valid values |
|---|---|---|---|
employee_campus | string | Physical campus where the employee is based | Any campus value present in the training set (see warning below) |
has_criminal_record | integer | Whether the employee has a known criminal record | 0 or 1 |
is_contractor | integer | Whether the employee is a contractor rather than permanent staff | 0 or 1 |
has_foreign_citizenship | integer | Whether the employee holds citizenship of a foreign country | 0 or 1 |
entry_during_weekend | integer | Whether the employee badged in on a weekend during the observation period | 0 or 1 |
late_exit_flag | integer | Whether the employee exited the facility after normal working hours | 0 or 1 |
total_printed_pages | integer | Total pages printed across the observation period | Non-negative integer |
num_printed_pages_off_hours | integer | Pages printed outside standard business hours | Non-negative integer |
total_files_burned | integer | Files written to removable media during the observation period | Non-negative integer |
Engineered features
The pipeline derives five features from the raw input columns immediately after loading your CSV. Do not include these columns in your file — the pipeline computes and appends them internally before scaling and prediction.| Feature | Formula | What it captures |
|---|---|---|
print_ratio | total_printed_pages / num_printed_pages_off_hours | Proportion of all printing that occurs off hours |
file_ratio | total_files_burned / num_printed_pages_off_hours | File exfiltration relative to off-hours printing activity |
risk_ratio | has_criminal_record + is_contractor + has_foreign_citizenship | Aggregate background-risk score (0–3) |
access_ratio | num_printed_pages_off_hours * entry_during_weekend | Combined off-hours and weekend access indicator |
afterhrs_ratio | late_exit_flag * num_printed_pages_off_hours | Off-hours printing weighted by late exits |
When
num_printed_pages_off_hours is 0, division produces inf. The pipeline replaces all inf and -inf values with NaN and then fills them with the column median, so a zero denominator does not abort inference.Output columns
After scoring, ThreatDetect appends four columns to the input data and returns the enriched DataFrame.| Column | Type | Description |
|---|---|---|
Prediction | string | Classification result: "Malicious" or "Normal" |
Risk_Prob | float (0–1) | XGBoost probability that the observation belongs to the threat class |
Anomaly_Score | float | Isolation Forest decision_function score; more negative values indicate stronger anomalies |
Confidence | float (0–1) | Model confidence derived from Risk_Prob relative to best_threshold |