Model Evaluation

After training, CEREBRO.py evaluates the pipeline on the 20% hold-out test set that was never seen during training. The achieved accuracy is 91.90%.

Evaluation code

# Generate predictions on the held-out test set
y_pred = pipeline.predict(X_test)

# Overall percentage of correct predictions
print(f"Accuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")

# Per-class F1 averaged without frequency weighting
print(f"F1-macro: {f1_score(y_test, y_pred, average='macro')*100:.2f}%")

# Precision, recall, and F1 for every class
# zero_division=1: avoids errors for classes with zero predicted samples
print(classification_report(y_test, y_pred, zero_division=1))

To re-run evaluation, execute:

python CEREBRO.py

The script trains from scratch and prints accuracy, F1-macro, and the full classification report to stdout.

Metrics

Metric	Value	What it measures
`accuracy_score`	91.90%	Percentage of all predictions that are correct
`f1_score(average='macro')`	—	Per-class F1 averaged without weighting by class frequency
`classification_report`	—	Precision, recall, and F1 broken down per class
`confusion_matrix`	—	True/false positives and negatives for each class pair

Why macro F1 matters

With a dataset where 60% of records are Normal and only 3% are Inyección SQL, a naive classifier that predicts Normal for every packet would achieve 60% accuracy — which looks acceptable but misses every real attack. Macro F1 averages the per-class F1 score without weighting, so a minority class like Inyección SQL contributes equally to the final number as Normal. A high macro F1 confirms the model is genuinely learning minority classes rather than exploiting the majority.

SMOTE-balanced training and class_weight='balanced' in the Random Forest are specifically aimed at improving minority-class recall, which directly raises macro F1.

Class labels

After label encoding and the minority-class grouping step in CEREBRO.py, the model operates on the following class space:

Label	Description	Grouping note
Normal	Legitimate network traffic	—
SYN Flood	High-rate SYN packets targeting one host	—
DDoS Distribuido	Volumetric flood from many source IPs	—
PORT scanner	Sequential port probing from persistent IPs	—
Posible Exploit	Traffic to known exploit ports (SMB, RDP, FTP, Telnet)	—
UDP Flood	High-rate UDP to high port range	—
Inyección SQL	TCP traffic to database ports	—
Otros	Aggregated minority classes (encoded 0, 2, 3)	Remapped to class 9

The Otros class is an internal grouping label. During live inference the IDS label shown in the SOC panel reflects the heuristic classification when the ML confidence is below 70% (see confidence thresholding below).

Confidence thresholding at inference time

The model returns a probability vector for each prediction. The live IDS in ids.py applies a 70% confidence threshold to decide which classification to trust:

ML confidence	Action
≥ 70%	ML label is used for the event; block decision follows ML verdict
< 70%	Heuristic label takes over; if the heuristic confirms a critical attack (SYN Flood, Exploit, etc.) the IP is blocked immediately

This prevents the model’s uncertainty from creating a gap where neither classifier acts. Heuristic rules are deterministic and always produce a verdict, so they serve as a reliable fallback.

The live SOC panel displays the ML confidence alongside the label, for example (ML: 98.4%). Events where confidence is below 70% are marked with the heuristic label instead, making the source of the classification transparent to the operator.

Limitations and considerations

Synthetic training data: The 91.90% accuracy is measured on held-out synthetic data generated by the same distribution used for training. Performance on real network captures may differ and should be validated once the system is deployed in a live environment.
Class grouping: Remapping minority classes to Otros improves overall stability but means the model cannot distinguish between the individual grouped attack types. If those distinctions become operationally important, the training data for those classes should be expanded before removing the grouping.
Retraining cadence: As guardar_dataset.py accumulates real events from live operation, periodically retraining with the merged dataset will improve the model’s alignment with actual network patterns.

Training

Testing & Simulation

Evaluation code

Metrics

Why macro F1 matters

Class labels

Confidence thresholding at inference time

Limitations and considerations

Build docs developers (and LLMs) love

Training

Testing & Simulation

​Evaluation code

​Metrics

​Why macro F1 matters

​Class labels

​Confidence thresholding at inference time

​Limitations and considerations

Build docs developers (and LLMs) love

Evaluation code

Metrics

Why macro F1 matters

Class labels

Confidence thresholding at inference time

Limitations and considerations