CEREBRO.py is the full end-to-end training pipeline. Run it once after generating the dataset and it produces every .pkl artifact the live IDS needs for inference. The script trains a soft-voting ensemble of Random Forest, MLP, and XGBoost inside an ImbPipeline that handles class balancing and feature scaling automatically.
How to run
Dataset/escanerpuertos.csv must exist in the working directory. Generate it first with python generar_dataset.py.
The script prints training progress and evaluation metrics to stdout, then writes the model and encoder files to the working directory.
Training pipeline
Load and clean data
The CSV is read with
latin-1 encoding to handle special characters, and bad lines are skipped rather than raising an error. Rows with any NaN value are dropped before any transformation.Feature engineering
IP addresses are not usable as strings. The
ip_to_int() converts each IP to a 32-bit unsigned integer using the network byte order, producing a numeric value the model can reason about ordinally.hour field is extracted from the timestamp column to give the model temporal context:Label encoding
The three categorical string columns are encoded to integers using
LabelEncoder. Each encoder is fitted once and immediately serialized so inference-time code uses the identical mapping.Class grouping
After encoding, classes with extremely few samples (encoded as 0, 2, and 3) are remapped to a synthetic class 9 labeled “Otros”. This reduces extreme imbalance while preserving the meaningful distinctions between the majority attack types.
The encoder is re-fitted and re-saved after grouping. The final
tipo_ataque_encoder.pkl reflects the post-grouping class space, not the original seven classes.Feature selection
SelectKBest with the ANOVA F-test scores each feature by its statistical relationship to the target variable. With k=6 and six features available, all features are retained — but the selector still computes importance scores and enforces the ordering expected by the downstream pipeline.| Position | Feature | Description |
|---|---|---|
| 0 | src_ip_int | Source IP as uint32 |
| 1 | dst_ip_int | Destination IP as uint32 |
| 2 | dst_port | Destination port number |
| 3 | protocol_encoded | TCP/UDP encoded as integer |
| 4 | flag_encoded | TCP flag encoded as integer |
| 5 | hour | Hour of day (0–23) |
Train/test split
The dataset is split 80/20.
stratify=y_raw ensures that every class keeps the same proportion in both partitions, which is critical with an imbalanced dataset.SMOTE oversampling
SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic feature vectors for under-represented classes by interpolating between existing minority samples. It runs as the first step inside the
ImbPipeline so it only sees training data — never the test set.This ensures the model is trained on a balanced class distribution without duplicating real records.StandardScaler normalization
After SMOTE, all features are normalized to mean=0, std=1. This step is required for the MLP (which is sensitive to feature scale) and also improves gradient descent convergence for XGBoost. Random Forest is scale-invariant but is unaffected.Scaling happens inside the pipeline, so the scaler’s parameters are learned only from training data and applied consistently at inference time.
VotingClassifier (soft voting)
Three classifiers are combined via soft voting. Each model produces a probability vector over all classes; the vectors are averaged and the class with the highest averaged probability wins.
| Model | Role |
|---|---|
| Random Forest | Robust to overfitting; handles non-linear relationships; class_weight='balanced' adds an internal correction for imbalance |
| MLP (64×64) | Captures complex interactions between features; early_stopping prevents overfitting on minority classes |
| XGBoost | High accuracy on tabular data; gradient boosting progressively corrects previous errors |
Output files
After a successful run, the following files are written to the working directory:| File | Contents |
|---|---|
modelo_ensamble_optimizado.pkl | Full ImbPipeline: SMOTE + StandardScaler + VotingClassifier |
features_seleccionadas.pkl | Ordered list of feature names used during training |
protocol_encoder.pkl | LabelEncoder fitted on protocol values |
flag_encoder.pkl | LabelEncoder fitted on flag values |
tipo_ataque_encoder.pkl | LabelEncoder fitted on the post-grouping class labels |
Inference preprocessing
At inference time,preprocesar_datos() in CEREBRO.py transforms a raw packet tuple into the exact feature vector the pipeline expects: