After training, CEREBRO.py evaluates the pipeline on the 20% hold-out test set that was never seen during training. The achieved accuracy is 91.90%.
Evaluation code
# Generate predictions on the held-out test set
y_pred = pipeline.predict(X_test)
# Overall percentage of correct predictions
print(f"Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")
# Per-class F1 averaged without frequency weighting
print(f"F1-macro: {f1_score(y_test, y_pred, average='macro')*100:.2f}%")
# Precision, recall, and F1 for every class
# zero_division=1: avoids errors for classes with zero predicted samples
print(classification_report(y_test, y_pred, zero_division=1))
To re-run evaluation, execute:
The script trains from scratch and prints accuracy, F1-macro, and the full classification report to stdout.
Metrics
| Metric | Value | What it measures |
|---|
accuracy_score | 91.90% | Percentage of all predictions that are correct |
f1_score(average='macro') | — | Per-class F1 averaged without weighting by class frequency |
classification_report | — | Precision, recall, and F1 broken down per class |
confusion_matrix | — | True/false positives and negatives for each class pair |
Why macro F1 matters
With a dataset where 60% of records are Normal and only 3% are Inyección SQL, a naive classifier that predicts Normal for every packet would achieve 60% accuracy — which looks acceptable but misses every real attack.
Macro F1 averages the per-class F1 score without weighting, so a minority class like Inyección SQL contributes equally to the final number as Normal. A high macro F1 confirms the model is genuinely learning minority classes rather than exploiting the majority.
SMOTE-balanced training and class_weight='balanced' in the Random Forest are specifically aimed at improving minority-class recall, which directly raises macro F1.
Class labels
After label encoding and the minority-class grouping step in CEREBRO.py, the model operates on the following class space:
| Label | Description | Grouping note |
|---|
| Normal | Legitimate network traffic | — |
| SYN Flood | High-rate SYN packets targeting one host | — |
| DDoS Distribuido | Volumetric flood from many source IPs | — |
| PORT scanner | Sequential port probing from persistent IPs | — |
| Posible Exploit | Traffic to known exploit ports (SMB, RDP, FTP, Telnet) | — |
| UDP Flood | High-rate UDP to high port range | — |
| Inyección SQL | TCP traffic to database ports | — |
| Otros | Aggregated minority classes (encoded 0, 2, 3) | Remapped to class 9 |
The Otros class is an internal grouping label. During live inference the IDS label shown in the SOC panel reflects the heuristic classification when the ML confidence is below 70% (see confidence thresholding below).
Confidence thresholding at inference time
The model returns a probability vector for each prediction. The live IDS in ids.py applies a 70% confidence threshold to decide which classification to trust:
| ML confidence | Action |
|---|
| ≥ 70% | ML label is used for the event; block decision follows ML verdict |
| < 70% | Heuristic label takes over; if the heuristic confirms a critical attack (SYN Flood, Exploit, etc.) the IP is blocked immediately |
This prevents the model’s uncertainty from creating a gap where neither classifier acts. Heuristic rules are deterministic and always produce a verdict, so they serve as a reliable fallback.
The live SOC panel displays the ML confidence alongside the label, for example (ML: 98.4%). Events where confidence is below 70% are marked with the heuristic label instead, making the source of the classification transparent to the operator.
Limitations and considerations
- Synthetic training data: The 91.90% accuracy is measured on held-out synthetic data generated by the same distribution used for training. Performance on real network captures may differ and should be validated once the system is deployed in a live environment.
- Class grouping: Remapping minority classes to
Otros improves overall stability but means the model cannot distinguish between the individual grouped attack types. If those distinctions become operationally important, the training data for those classes should be expanded before removing the grouping.
- Retraining cadence: As
guardar_dataset.py accumulates real events from live operation, periodically retraining with the merged dataset will improve the model’s alignment with actual network patterns.