Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt

Use this file to discover all available pages before exploring further.

ThreatDetect accepts a single CSV file where each row represents one employee observation period. The pipeline validates column presence, encodes the categorical campus column using the label encoder fitted at training time, scales numeric values with the stored StandardScaler, and then derives five additional features automatically — you never need to compute them yourself. Understanding the expected schema helps you catch data issues before they reach the model.

Required input columns

Your CSV must include all nine columns listed below. Column names are case-sensitive and must match exactly. Extra columns are ignored; any missing column raises a KeyError before inference begins.
ColumnTypeDescriptionValid values
employee_campusstringPhysical campus where the employee is basedAny campus value present in the training set (see warning below)
has_criminal_recordintegerWhether the employee has a known criminal record0 or 1
is_contractorintegerWhether the employee is a contractor rather than permanent staff0 or 1
has_foreign_citizenshipintegerWhether the employee holds citizenship of a foreign country0 or 1
entry_during_weekendintegerWhether the employee badged in on a weekend during the observation period0 or 1
late_exit_flagintegerWhether the employee exited the facility after normal working hours0 or 1
total_printed_pagesintegerTotal pages printed across the observation periodNon-negative integer
num_printed_pages_off_hoursintegerPages printed outside standard business hoursNon-negative integer
total_files_burnedintegerFiles written to removable media during the observation periodNon-negative integer
The employee_campus column is encoded with a LabelEncoder fitted on the training dataset. Passing a campus value that was not seen during training raises a ValueError at inference time. Validate all campus values against the encoder’s known classes before uploading a CSV.

Engineered features

The pipeline derives five features from the raw input columns immediately after loading your CSV. Do not include these columns in your file — the pipeline computes and appends them internally before scaling and prediction.
FeatureFormulaWhat it captures
print_ratiototal_printed_pages / num_printed_pages_off_hoursProportion of all printing that occurs off hours
file_ratiototal_files_burned / num_printed_pages_off_hoursFile exfiltration relative to off-hours printing activity
risk_ratiohas_criminal_record + is_contractor + has_foreign_citizenshipAggregate background-risk score (0–3)
access_rationum_printed_pages_off_hours * entry_during_weekendCombined off-hours and weekend access indicator
afterhrs_ratiolate_exit_flag * num_printed_pages_off_hoursOff-hours printing weighted by late exits
When num_printed_pages_off_hours is 0, division produces inf. The pipeline replaces all inf and -inf values with NaN and then fills them with the column median, so a zero denominator does not abort inference.

Output columns

After scoring, ThreatDetect appends four columns to the input data and returns the enriched DataFrame.
ColumnTypeDescription
PredictionstringClassification result: "Malicious" or "Normal"
Risk_Probfloat (0–1)XGBoost probability that the observation belongs to the threat class
Anomaly_ScorefloatIsolation Forest decision_function score; more negative values indicate stronger anomalies
Confidencefloat (0–1)Model confidence derived from Risk_Prob relative to best_threshold

Example CSV header

The line below shows a valid header row. Column order is flexible as long as all nine names are present.
employee_campus,has_criminal_record,is_contractor,has_foreign_citizenship,entry_during_weekend,late_exit_flag,total_printed_pages,num_printed_pages_off_hours,total_files_burned

Build docs developers (and LLMs) love