CEREBRO.py) trains a soft-voting ensemble of three classifiers on a labeled network-traffic dataset, then serializes the complete pipeline to disk. At runtime, ids.py loads those serialized artifacts and calls clasificar_ataque_ml() for every detected event.
Architecture
The model is aVotingClassifier with voting='soft', meaning each sub-estimator outputs class probabilities, and the final prediction is the class with the highest averaged probability across all three estimators.
CEREBRO.py
| Estimator | Strengths |
|---|---|
RandomForestClassifier(n_estimators=100, class_weight='balanced') | Robust to noise; handles class imbalance with internal weighting |
MLPClassifier(hidden_layer_sizes=(64, 64), early_stopping=True) | Captures non-linear relationships; early stopping prevents overfitting |
XGBClassifier(eval_metric='mlogloss') | High accuracy on tabular data; gradient boosting with log-loss optimization |
Training pipeline
The full training pipeline is anImbPipeline (imblearn) that chains three stages:
CEREBRO.py
- SMOTE — Synthetic Minority Over-sampling Technique. Generates synthetic samples for under-represented attack classes so every class has equal representation during training.
- StandardScaler — Normalizes each feature to zero mean and unit variance. Required by the MLP estimator; also benefits XGBoost and Random Forest.
- VotingClassifier — The ensemble described above.
SMOTE is applied only at training time. The serialized
pipeline object does not re-apply SMOTE on inference calls — predict() and predict_proba() skip directly to the scaler and classifier steps.Feature set
The model trains on six features, all numeric:CEREBRO.py
SelectKBest(f_classif, k=6) confirms these as the six most statistically relevant features via ANOVA F-test. The selected feature list is saved to features_seleccionadas.pkl to guarantee consistent column ordering at inference time.
Label encoding
ThreeLabelEncoder instances map string values to integers:
| Encoder | Column | Example mapping |
|---|---|---|
protocol_encoder | protocol | "TCP" → 0, "UDP" → 1 |
flag_encoder | flag | "S" → some integer, "SA" → another |
tipo_ataque_encoder | tipo_ataque (target) | "Normal" → 1, "SYN Flood" → 3 |
.pkl files. They are loaded by ids.py at startup so that inference uses the identical vocabulary.
ip_to_int conversion
During training, IP addresses are converted to 32-bit unsigned integers using the standard network-byte-order representation:
CEREBRO.py
"192.168.1.1" → 3232235777.
At inference time,
ids.py uses hash(ip_str) % (10**8) instead of ip_to_int(). This is a deliberate design choice to avoid a dependency on socket in the hot path. The resulting integers differ from training values, but the model generalises well enough to accept them.Attack classes
The model distinguishes seven output classes:| Class | Description |
|---|---|
Normal | Legitimate traffic — no threat |
SYN Flood | TCP SYN packet storm exhausting server connection tables |
DDoS Distribuido | High-volume traffic from multiple sources to one destination |
PORT scanner | Systematic probing of many destination ports |
Posible Exploit | Connection attempt to historically vulnerable ports (SMB, RDP, FTP, …) |
UDP Flood | Volumetric UDP packet storm |
Inyección SQL | HTTP payload containing SQL injection patterns |
"Otros" bucket during training to reduce extreme class imbalance.
clasificar_ataque_ml — inference
ids.py
(tipo_str, confianza) tuple where confianza is the maximum probability across all classes (0.0 – 1.0).
ids.py
Confidence threshold
The caller (guardar_ataque) applies a 70% threshold before accepting the ML verdict:
ids.py
- ≥ 70% — ML label is used. The event is displayed as e.g.
SYN Flood (ML: 98.4%). - < 70% — The heuristic label that triggered the detector is used instead. The IPS can still block the IP based on heuristic certainty alone.
Model files
All five artifacts are placed in the project root directory alongsideids.py:
| File | Contents |
|---|---|
modelo_ensamble_optimizado.pkl | Full ImbPipeline (SMOTE → StandardScaler → VotingClassifier) |
features_seleccionadas.pkl | list[str] of the six selected feature names in training order |
flag_encoder.pkl | LabelEncoder fitted on TCP flag strings |
protocol_encoder.pkl | LabelEncoder fitted on protocol strings (TCP, UDP, …) |
tipo_ataque_encoder.pkl | LabelEncoder fitted on attack class labels |
ids.py startup with a try/except FileNotFoundError. If any file is missing, the corresponding variable is set to None and the system falls back to heuristic-only detection without crashing.