Feature engineering: behavioral risk signals

ThreatDetect transforms raw employee behavioral records into a numeric feature matrix through three sequential steps: validation of required input columns, derivation of five composite ratio features, and preprocessing (encoding and scaling) using objects fitted during training. The resulting matrix — augmented with an Isolation Forest anomaly score — is what the XGBoost model receives at inference time.

Raw input features

The following columns must be present in every CSV you submit. These are the raw signals collected from employee records, access logs, and HR data. The model cannot run if any of these columns are missing.

Column	Type	Description
`employee_campus`	Categorical	Campus or office location of the employee
`has_criminal_record`	Binary (0/1)	Whether the employee has a known criminal record
`is_contractor`	Binary (0/1)	Whether the employee is a contractor rather than permanent staff
`has_foreign_citizenship`	Binary (0/1)	Whether the employee holds foreign citizenship
`total_printed_pages`	Numeric	Total number of pages printed by the employee
`num_printed_pages_off_hours`	Numeric	Pages printed outside standard working hours
`total_files_burned`	Numeric	Total files copied to removable media (e.g., burned to disc)
`entry_during_weekend`	Binary (0/1)	Whether the employee accessed the premises during a weekend
`late_exit_flag`	Binary (0/1)	Whether the employee has recorded late exits from secured areas

The model package stores the authoritative list of required columns in feature_columns. The prepare_features function derives the required raw columns by excluding the five engineered features and isolation_forest_anomaly_score from that list.

Engineered features

Five composite features are derived from the raw inputs to capture behavioral ratios and compound risk signals. These features are computed after input validation and before encoding or scaling.

df['print_ratio'] = df['total_printed_pages'] / df['num_printed_pages_off_hours']
df['file_ratio'] = df['total_files_burned'] / df['num_printed_pages_off_hours']
df['risk_ratio'] = (
    df['has_criminal_record'] + df['is_contractor'] + df['has_foreign_citizenship']
)
df['access_ratio'] = df['num_printed_pages_off_hours'] * df['entry_during_weekend']
df['afterhrs_ratio'] = df['late_exit_flag'] * df['num_printed_pages_off_hours']

Feature	Formula	What it captures
`print_ratio`	`total_printed_pages / num_printed_pages_off_hours`	Proportion of all printing that occurs off-hours
`file_ratio`	`total_files_burned / num_printed_pages_off_hours`	File exfiltration activity relative to off-hours printing volume
`risk_ratio`	`has_criminal_record + is_contractor + has_foreign_citizenship`	Additive background risk score (0–3)
`access_ratio`	`num_printed_pages_off_hours × entry_during_weekend`	Off-hours printing combined with weekend access
`afterhrs_ratio`	`late_exit_flag × num_printed_pages_off_hours`	Off-hours printing combined with late exit behavior

Handling inf and NaN values

print_ratio and file_ratio divide by num_printed_pages_off_hours, which can be zero. Any inf, -inf, or NaN values produced by the engineered feature calculations are replaced with the column median:

df[engineered_cols] = df[engineered_cols].replace([np.inf, -np.inf], np.nan)
df[engineered_cols] = df[engineered_cols].fillna(df[engineered_cols].median())

This ensures zero-division edge cases do not propagate through to the model.

Preprocessing

After feature engineering, categorical and numeric columns are transformed using objects fitted during model training.

Categorical encoding

Each categorical column (stored in cat_cols in the model package, which includes employee_campus) is encoded with its corresponding LabelEncoder:

for col in cat_cols:
    le = model_package['label_encoders'][col]
    values = df[col].astype(str).str.strip()
    unseen = ~values.isin(le.classes_)
    if unseen.any():
        raise ValueError(
            f"Column '{col}' contains unseen categories: "
            f"{sorted(values[unseen].unique())}."
        )
    df[col] = le.transform(values)

If a categorical column contains a value the encoder has never seen (for example, a new campus code not present in the training data), ThreatDetect raises a ValueError and halts processing. You must remap or remove unseen categories before re-submitting the file.

Numeric scaling

All numeric columns (stored in num_cols) are transformed using the StandardScaler fitted during training. This ensures that the scale of each feature at inference matches the scale the model was trained on:

df[num_cols] = model_package['scaler'].transform(df[num_cols])

Before scaling, any NaN values in raw numeric columns are filled with the column median to prevent the scaler from failing on missing data:

raw_num_cols = [col for col in num_cols if col in df.columns]
df[raw_num_cols] = df[raw_num_cols].apply(pd.to_numeric, errors='coerce')
df[raw_num_cols] = df[raw_num_cols].fillna(df[raw_num_cols].median())

Final feature matrix

After preprocessing, the feature matrix is assembled in the order defined by feature_columns (excluding the last entry, isolation_forest_anomaly_score). The Isolation Forest then scores the matrix and its output is appended as the final column before the XGBoost model receives the data:

feature_cols = model_package['feature_columns'][:-1]   # all except iso score
x_for_iso = df[feature_cols].to_numpy()
iso_scores = model_package['iso_forest'].decision_function(x_for_iso).reshape(-1, 1)
x_append = np.hstack((x_for_iso, iso_scores))

See Model architecture for details on how the Isolation Forest score is used by the XGBoost classifier.

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

Feature engineering: behavioral risk signals

Raw input features

Engineered features

Handling inf and NaN values

Preprocessing

Categorical encoding

Numeric scaling

Final feature matrix

Build docs developers (and LLMs) love

Get Started

Core Concepts

Using ThreatDetect

Data & Model

Development

Documentation Index

​Raw input features

​Engineered features

​Handling inf and NaN values

​Preprocessing

​Categorical encoding

​Numeric scaling

​Final feature matrix

Build docs developers (and LLMs) love

Raw input features

Engineered features

Handling inf and NaN values

Preprocessing

Categorical encoding

Numeric scaling

Final feature matrix