Skip to main content
The ML model (CEREBRO.py) trains a soft-voting ensemble of three classifiers on a labeled network-traffic dataset, then serializes the complete pipeline to disk. At runtime, ids.py loads those serialized artifacts and calls clasificar_ataque_ml() for every detected event.

Architecture

The model is a VotingClassifier with voting='soft', meaning each sub-estimator outputs class probabilities, and the final prediction is the class with the highest averaged probability across all three estimators.
CEREBRO.py
ensemble_model = VotingClassifier(
    estimators=[
        ('rf',  RandomForestClassifier(
                    n_estimators=100,
                    class_weight='balanced',
                    random_state=42)),
        ('mlp', MLPClassifier(
                    hidden_layer_sizes=(64, 64),
                    max_iter=300,
                    early_stopping=True,
                    random_state=42)),
        ('xgb', XGBClassifier(
                    eval_metric='mlogloss',
                    use_label_encoder=False,
                    random_state=42))
    ],
    voting='soft',
    n_jobs=-1
)
EstimatorStrengths
RandomForestClassifier(n_estimators=100, class_weight='balanced')Robust to noise; handles class imbalance with internal weighting
MLPClassifier(hidden_layer_sizes=(64, 64), early_stopping=True)Captures non-linear relationships; early stopping prevents overfitting
XGBClassifier(eval_metric='mlogloss')High accuracy on tabular data; gradient boosting with log-loss optimization

Training pipeline

The full training pipeline is an ImbPipeline (imblearn) that chains three stages:
CEREBRO.py
pipeline = ImbPipeline([
    ('smote',  SMOTE(random_state=42)),
    ('scaler', StandardScaler()),
    ('clf',    ensemble_model)
])
  1. SMOTE — Synthetic Minority Over-sampling Technique. Generates synthetic samples for under-represented attack classes so every class has equal representation during training.
  2. StandardScaler — Normalizes each feature to zero mean and unit variance. Required by the MLP estimator; also benefits XGBoost and Random Forest.
  3. VotingClassifier — The ensemble described above.
SMOTE is applied only at training time. The serialized pipeline object does not re-apply SMOTE on inference calls — predict() and predict_proba() skip directly to the scaler and classifier steps.

Feature set

The model trains on six features, all numeric:
CEREBRO.py
features = [
    'src_ip_int',        # Source IP as 32-bit integer
    'dst_ip_int',        # Destination IP as 32-bit integer
    'dst_port',          # Destination port number
    'protocol_encoded',  # Protocol label-encoded (TCP=0, UDP=1, ...)
    'flag_encoded',      # TCP flag label-encoded (S=0, A=1, SA=2, ...)
    'hour'               # Hour of day extracted from packet timestamp (0–23)
]
SelectKBest(f_classif, k=6) confirms these as the six most statistically relevant features via ANOVA F-test. The selected feature list is saved to features_seleccionadas.pkl to guarantee consistent column ordering at inference time.

Label encoding

Three LabelEncoder instances map string values to integers:
EncoderColumnExample mapping
protocol_encoderprotocol"TCP"0, "UDP"1
flag_encoderflag"S" → some integer, "SA" → another
tipo_ataque_encodertipo_ataque (target)"Normal"1, "SYN Flood"3
All three encoders are fit during training and saved to .pkl files. They are loaded by ids.py at startup so that inference uses the identical vocabulary.

ip_to_int conversion

During training, IP addresses are converted to 32-bit unsigned integers using the standard network-byte-order representation:
CEREBRO.py
def ip_to_int(ip):
    try:
        # socket.inet_aton: IP string → 4 network-order bytes
        # struct.unpack("!I", ...): 4 bytes → unsigned 32-bit int
        return struct.unpack("!I", socket.inet_aton(ip))[0]
    except socket.error:
        return 0  # Invalid IP defaults to 0
Example: "192.168.1.1"3232235777.
At inference time, ids.py uses hash(ip_str) % (10**8) instead of ip_to_int(). This is a deliberate design choice to avoid a dependency on socket in the hot path. The resulting integers differ from training values, but the model generalises well enough to accept them.

Attack classes

The model distinguishes seven output classes:
ClassDescription
NormalLegitimate traffic — no threat
SYN FloodTCP SYN packet storm exhausting server connection tables
DDoS DistribuidoHigh-volume traffic from multiple sources to one destination
PORT scannerSystematic probing of many destination ports
Posible ExploitConnection attempt to historically vulnerable ports (SMB, RDP, FTP, …)
UDP FloodVolumetric UDP packet storm
Inyección SQLHTTP payload containing SQL injection patterns
Minority classes were grouped into a catch-all "Otros" bucket during training to reduce extreme class imbalance.

clasificar_ataque_ml — inference

ids.py
def clasificar_ataque_ml(ip_src, ip_dst, puerto, protocolo, flag):
Returns a (tipo_str, confianza) tuple where confianza is the maximum probability across all classes (0.0 – 1.0).
ids.py
df_entrada = pd.DataFrame({
    'src_ip_int':       [hash(str(ip_src)) % (10**8)],
    'dst_ip_int':       [hash(str(ip_dst)) % (10**8)],
    'dst_port':         [puerto],
    'protocol_encoded': [protocolo_encoded],
    'flag_encoded':     [flag_encoded],
    'hour':             [pd.Timestamp.now().hour]
})

tipo_pred = modelo_ml.predict(df_entrada)[0]
probs     = modelo_ml.predict_proba(df_entrada)[0]
confianza = probs.max()
tipo_str  = tipo_ataque_encoder.inverse_transform([tipo_pred])[0]

Confidence threshold

The caller (guardar_ataque) applies a 70% threshold before accepting the ML verdict:
ids.py
if pred_ml and pred_ml != "Normal" and confianza >= 0.70:
    tipo_final = f"{pred_ml} (ML: {confianza*100:.1f}%)"
else:
    tipo_final = f"{tipo_ataque} (Heurística)"
  • ≥ 70% — ML label is used. The event is displayed as e.g. SYN Flood (ML: 98.4%).
  • < 70% — The heuristic label that triggered the detector is used instead. The IPS can still block the IP based on heuristic certainty alone.
The 70% threshold was chosen based on the model’s 91.90% training accuracy. Lowering it increases ML-labeled detections but may also increase false positives in edge cases.

Model files

All five artifacts are placed in the project root directory alongside ids.py:
FileContents
modelo_ensamble_optimizado.pklFull ImbPipeline (SMOTE → StandardScaler → VotingClassifier)
features_seleccionadas.pkllist[str] of the six selected feature names in training order
flag_encoder.pklLabelEncoder fitted on TCP flag strings
protocol_encoder.pklLabelEncoder fitted on protocol strings (TCP, UDP, …)
tipo_ataque_encoder.pklLabelEncoder fitted on attack class labels
Each file is loaded at ids.py startup with a try/except FileNotFoundError. If any file is missing, the corresponding variable is set to None and the system falls back to heuristic-only detection without crashing.
Deleting or replacing a single .pkl file without regenerating the others will cause a mismatch between the encoder vocabularies and the model’s class indices, leading to incorrect label decoding. Always regenerate all five files together by re-running CEREBRO.py.

Build docs developers (and LLMs) love