Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jazbengu/ThreatDetect/llms.txt

Use this file to discover all available pages before exploring further.

ThreatDetect transforms raw employee behavioral records into a numeric feature matrix through three sequential steps: validation of required input columns, derivation of five composite ratio features, and preprocessing (encoding and scaling) using objects fitted during training. The resulting matrix — augmented with an Isolation Forest anomaly score — is what the XGBoost model receives at inference time.

Raw input features

The following columns must be present in every CSV you submit. These are the raw signals collected from employee records, access logs, and HR data. The model cannot run if any of these columns are missing.
ColumnTypeDescription
employee_campusCategoricalCampus or office location of the employee
has_criminal_recordBinary (0/1)Whether the employee has a known criminal record
is_contractorBinary (0/1)Whether the employee is a contractor rather than permanent staff
has_foreign_citizenshipBinary (0/1)Whether the employee holds foreign citizenship
total_printed_pagesNumericTotal number of pages printed by the employee
num_printed_pages_off_hoursNumericPages printed outside standard working hours
total_files_burnedNumericTotal files copied to removable media (e.g., burned to disc)
entry_during_weekendBinary (0/1)Whether the employee accessed the premises during a weekend
late_exit_flagBinary (0/1)Whether the employee has recorded late exits from secured areas
The model package stores the authoritative list of required columns in feature_columns. The prepare_features function derives the required raw columns by excluding the five engineered features and isolation_forest_anomaly_score from that list.

Engineered features

Five composite features are derived from the raw inputs to capture behavioral ratios and compound risk signals. These features are computed after input validation and before encoding or scaling.
df['print_ratio'] = df['total_printed_pages'] / df['num_printed_pages_off_hours']
df['file_ratio'] = df['total_files_burned'] / df['num_printed_pages_off_hours']
df['risk_ratio'] = (
    df['has_criminal_record'] + df['is_contractor'] + df['has_foreign_citizenship']
)
df['access_ratio'] = df['num_printed_pages_off_hours'] * df['entry_during_weekend']
df['afterhrs_ratio'] = df['late_exit_flag'] * df['num_printed_pages_off_hours']
FeatureFormulaWhat it captures
print_ratiototal_printed_pages / num_printed_pages_off_hoursProportion of all printing that occurs off-hours
file_ratiototal_files_burned / num_printed_pages_off_hoursFile exfiltration activity relative to off-hours printing volume
risk_ratiohas_criminal_record + is_contractor + has_foreign_citizenshipAdditive background risk score (0–3)
access_rationum_printed_pages_off_hours × entry_during_weekendOff-hours printing combined with weekend access
afterhrs_ratiolate_exit_flag × num_printed_pages_off_hoursOff-hours printing combined with late exit behavior

Handling inf and NaN values

print_ratio and file_ratio divide by num_printed_pages_off_hours, which can be zero. Any inf, -inf, or NaN values produced by the engineered feature calculations are replaced with the column median:
df[engineered_cols] = df[engineered_cols].replace([np.inf, -np.inf], np.nan)
df[engineered_cols] = df[engineered_cols].fillna(df[engineered_cols].median())
This ensures zero-division edge cases do not propagate through to the model.

Preprocessing

After feature engineering, categorical and numeric columns are transformed using objects fitted during model training.

Categorical encoding

Each categorical column (stored in cat_cols in the model package, which includes employee_campus) is encoded with its corresponding LabelEncoder:
for col in cat_cols:
    le = model_package['label_encoders'][col]
    values = df[col].astype(str).str.strip()
    unseen = ~values.isin(le.classes_)
    if unseen.any():
        raise ValueError(
            f"Column '{col}' contains unseen categories: "
            f"{sorted(values[unseen].unique())}."
        )
    df[col] = le.transform(values)
If a categorical column contains a value the encoder has never seen (for example, a new campus code not present in the training data), ThreatDetect raises a ValueError and halts processing. You must remap or remove unseen categories before re-submitting the file.

Numeric scaling

All numeric columns (stored in num_cols) are transformed using the StandardScaler fitted during training. This ensures that the scale of each feature at inference matches the scale the model was trained on:
df[num_cols] = model_package['scaler'].transform(df[num_cols])
Before scaling, any NaN values in raw numeric columns are filled with the column median to prevent the scaler from failing on missing data:
raw_num_cols = [col for col in num_cols if col in df.columns]
df[raw_num_cols] = df[raw_num_cols].apply(pd.to_numeric, errors='coerce')
df[raw_num_cols] = df[raw_num_cols].fillna(df[raw_num_cols].median())

Final feature matrix

After preprocessing, the feature matrix is assembled in the order defined by feature_columns (excluding the last entry, isolation_forest_anomaly_score). The Isolation Forest then scores the matrix and its output is appended as the final column before the XGBoost model receives the data:
feature_cols = model_package['feature_columns'][:-1]   # all except iso score
x_for_iso = df[feature_cols].to_numpy()
iso_scores = model_package['iso_forest'].decision_function(x_for_iso).reshape(-1, 1)
x_append = np.hstack((x_for_iso, iso_scores))
See Model architecture for details on how the Isolation Forest score is used by the XGBoost classifier.

Build docs developers (and LLMs) love