Skip to main content
After training, CEREBRO.py evaluates the pipeline on the 20% hold-out test set that was never seen during training. The achieved accuracy is 91.90%.

Evaluation code

# Generate predictions on the held-out test set
y_pred = pipeline.predict(X_test)

# Overall percentage of correct predictions
print(f"Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")

# Per-class F1 averaged without frequency weighting
print(f"F1-macro: {f1_score(y_test, y_pred, average='macro')*100:.2f}%")

# Precision, recall, and F1 for every class
# zero_division=1: avoids errors for classes with zero predicted samples
print(classification_report(y_test, y_pred, zero_division=1))
To re-run evaluation, execute:
python CEREBRO.py
The script trains from scratch and prints accuracy, F1-macro, and the full classification report to stdout.

Metrics

MetricValueWhat it measures
accuracy_score91.90%Percentage of all predictions that are correct
f1_score(average='macro')Per-class F1 averaged without weighting by class frequency
classification_reportPrecision, recall, and F1 broken down per class
confusion_matrixTrue/false positives and negatives for each class pair

Why macro F1 matters

With a dataset where 60% of records are Normal and only 3% are Inyección SQL, a naive classifier that predicts Normal for every packet would achieve 60% accuracy — which looks acceptable but misses every real attack. Macro F1 averages the per-class F1 score without weighting, so a minority class like Inyección SQL contributes equally to the final number as Normal. A high macro F1 confirms the model is genuinely learning minority classes rather than exploiting the majority.
SMOTE-balanced training and class_weight='balanced' in the Random Forest are specifically aimed at improving minority-class recall, which directly raises macro F1.

Class labels

After label encoding and the minority-class grouping step in CEREBRO.py, the model operates on the following class space:
LabelDescriptionGrouping note
NormalLegitimate network traffic
SYN FloodHigh-rate SYN packets targeting one host
DDoS DistribuidoVolumetric flood from many source IPs
PORT scannerSequential port probing from persistent IPs
Posible ExploitTraffic to known exploit ports (SMB, RDP, FTP, Telnet)
UDP FloodHigh-rate UDP to high port range
Inyección SQLTCP traffic to database ports
OtrosAggregated minority classes (encoded 0, 2, 3)Remapped to class 9
The Otros class is an internal grouping label. During live inference the IDS label shown in the SOC panel reflects the heuristic classification when the ML confidence is below 70% (see confidence thresholding below).

Confidence thresholding at inference time

The model returns a probability vector for each prediction. The live IDS in ids.py applies a 70% confidence threshold to decide which classification to trust:
ML confidenceAction
≥ 70%ML label is used for the event; block decision follows ML verdict
< 70%Heuristic label takes over; if the heuristic confirms a critical attack (SYN Flood, Exploit, etc.) the IP is blocked immediately
This prevents the model’s uncertainty from creating a gap where neither classifier acts. Heuristic rules are deterministic and always produce a verdict, so they serve as a reliable fallback.
The live SOC panel displays the ML confidence alongside the label, for example (ML: 98.4%). Events where confidence is below 70% are marked with the heuristic label instead, making the source of the classification transparent to the operator.

Limitations and considerations

  • Synthetic training data: The 91.90% accuracy is measured on held-out synthetic data generated by the same distribution used for training. Performance on real network captures may differ and should be validated once the system is deployed in a live environment.
  • Class grouping: Remapping minority classes to Otros improves overall stability but means the model cannot distinguish between the individual grouped attack types. If those distinctions become operationally important, the training data for those classes should be expanded before removing the grouping.
  • Retraining cadence: As guardar_dataset.py accumulates real events from live operation, periodically retraining with the merged dataset will improve the model’s alignment with actual network patterns.

Build docs developers (and LLMs) love