Input data schema for ThreatDetect CSV files

ThreatDetect accepts a single CSV file where each row represents one employee observation period. The pipeline validates column presence, encodes the categorical campus column using the label encoder fitted at training time, scales numeric values with the stored StandardScaler, and then derives five additional features automatically — you never need to compute them yourself. Understanding the expected schema helps you catch data issues before they reach the model.

Required input columns

Your CSV must include all nine columns listed below. Column names are case-sensitive and must match exactly. Extra columns are ignored; any missing column raises a KeyError before inference begins.

Column	Type	Description	Valid values
`employee_campus`	string	Physical campus where the employee is based	Any campus value present in the training set (see warning below)
`has_criminal_record`	integer	Whether the employee has a known criminal record	`0` or `1`
`is_contractor`	integer	Whether the employee is a contractor rather than permanent staff	`0` or `1`
`has_foreign_citizenship`	integer	Whether the employee holds citizenship of a foreign country	`0` or `1`
`entry_during_weekend`	integer	Whether the employee badged in on a weekend during the observation period	`0` or `1`
`late_exit_flag`	integer	Whether the employee exited the facility after normal working hours	`0` or `1`
`total_printed_pages`	integer	Total pages printed across the observation period	Non-negative integer
`num_printed_pages_off_hours`	integer	Pages printed outside standard business hours	Non-negative integer
`total_files_burned`	integer	Files written to removable media during the observation period	Non-negative integer

The employee_campus column is encoded with a LabelEncoder fitted on the training dataset. Passing a campus value that was not seen during training raises a ValueError at inference time. Validate all campus values against the encoder’s known classes before uploading a CSV.

Engineered features

The pipeline derives five features from the raw input columns immediately after loading your CSV. Do not include these columns in your file — the pipeline computes and appends them internally before scaling and prediction.

Feature	Formula	What it captures
`print_ratio`	`total_printed_pages / num_printed_pages_off_hours`	Proportion of all printing that occurs off hours
`file_ratio`	`total_files_burned / num_printed_pages_off_hours`	File exfiltration relative to off-hours printing activity
`risk_ratio`	`has_criminal_record + is_contractor + has_foreign_citizenship`	Aggregate background-risk score (0–3)
`access_ratio`	`num_printed_pages_off_hours * entry_during_weekend`	Combined off-hours and weekend access indicator
`afterhrs_ratio`	`late_exit_flag * num_printed_pages_off_hours`	Off-hours printing weighted by late exits

When num_printed_pages_off_hours is 0, division produces inf. The pipeline replaces all inf and -inf values with NaN and then fills them with the column median, so a zero denominator does not abort inference.

Output columns

After scoring, ThreatDetect appends four columns to the input data and returns the enriched DataFrame.

Column	Type	Description
`Prediction`	string	Classification result: `"Malicious"` or `"Normal"`
`Risk_Prob`	float (0–1)	XGBoost probability that the observation belongs to the threat class
`Anomaly_Score`	float	Isolation Forest `decision_function` score; more negative values indicate stronger anomalies
`Confidence`	float (0–1)	Model confidence derived from `Risk_Prob` relative to `best_threshold`

Example CSV header

The line below shows a valid header row. Column order is flexible as long as all nine names are present.

employee_campus,has_criminal_record,is_contractor,has_foreign_citizenship,entry_during_weekend,late_exit_flag,total_printed_pages,num_printed_pages_off_hours,total_files_burned

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

Input data schema for ThreatDetect CSV files

Required input columns

Engineered features

Output columns

Example CSV header

Build docs developers (and LLMs) love

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

Documentation Index

​Required input columns

​Engineered features

​Output columns

​Example CSV header

Build docs developers (and LLMs) love

Required input columns

Engineered features

Output columns

Example CSV header