ML Model

The ML model (CEREBRO.py) trains a soft-voting ensemble of three classifiers on a labeled network-traffic dataset, then serializes the complete pipeline to disk. At runtime, ids.py loads those serialized artifacts and calls clasificar_ataque_ml() for every detected event.

Architecture

The model is a VotingClassifier with voting='soft', meaning each sub-estimator outputs class probabilities, and the final prediction is the class with the highest averaged probability across all three estimators.

CEREBRO.py

ensemble_model = VotingClassifier(
    estimators=[
        ('rf',  RandomForestClassifier(
                    n_estimators=100,
                    class_weight='balanced',
                    random_state=42)),
        ('mlp', MLPClassifier(
                    hidden_layer_sizes=(64, 64),
                    max_iter=300,
                    early_stopping=True,
                    random_state=42)),
        ('xgb', XGBClassifier(
                    eval_metric='mlogloss',
                    use_label_encoder=False,
                    random_state=42))
    ],
    voting='soft',
    n_jobs=-1
)

Estimator	Strengths
`RandomForestClassifier(n_estimators=100, class_weight='balanced')`	Robust to noise; handles class imbalance with internal weighting
`MLPClassifier(hidden_layer_sizes=(64, 64), early_stopping=True)`	Captures non-linear relationships; early stopping prevents overfitting
`XGBClassifier(eval_metric='mlogloss')`	High accuracy on tabular data; gradient boosting with log-loss optimization

Training pipeline

The full training pipeline is an ImbPipeline (imblearn) that chains three stages:

CEREBRO.py

pipeline = ImbPipeline([
    ('smote',  SMOTE(random_state=42)),
    ('scaler', StandardScaler()),
    ('clf',    ensemble_model)
])

SMOTE — Synthetic Minority Over-sampling Technique. Generates synthetic samples for under-represented attack classes so every class has equal representation during training.
StandardScaler — Normalizes each feature to zero mean and unit variance. Required by the MLP estimator; also benefits XGBoost and Random Forest.
VotingClassifier — The ensemble described above.

SMOTE is applied only at training time. The serialized pipeline object does not re-apply SMOTE on inference calls — predict() and predict_proba() skip directly to the scaler and classifier steps.

Feature set

The model trains on six features, all numeric:

CEREBRO.py

features = [
    'src_ip_int',        # Source IP as 32-bit integer
    'dst_ip_int',        # Destination IP as 32-bit integer
    'dst_port',          # Destination port number
    'protocol_encoded',  # Protocol label-encoded (TCP=0, UDP=1, ...)
    'flag_encoded',      # TCP flag label-encoded (S=0, A=1, SA=2, ...)
    'hour'               # Hour of day extracted from packet timestamp (0–23)
]

SelectKBest(f_classif, k=6) confirms these as the six most statistically relevant features via ANOVA F-test. The selected feature list is saved to features_seleccionadas.pkl to guarantee consistent column ordering at inference time.

Label encoding

Three LabelEncoder instances map string values to integers:

Encoder	Column	Example mapping
`protocol_encoder`	`protocol`	`"TCP"` → `0`, `"UDP"` → `1`
`flag_encoder`	`flag`	`"S"` → some integer, `"SA"` → another
`tipo_ataque_encoder`	`tipo_ataque` (target)	`"Normal"` → `1`, `"SYN Flood"` → `3`

All three encoders are fit during training and saved to .pkl files. They are loaded by ids.py at startup so that inference uses the identical vocabulary.

`ip_to_int` conversion

During training, IP addresses are converted to 32-bit unsigned integers using the standard network-byte-order representation:

CEREBRO.py

def ip_to_int(ip):
    try:
        # socket.inet_aton: IP string → 4 network-order bytes
        # struct.unpack("!I", ...): 4 bytes → unsigned 32-bit int
        return struct.unpack("!I", socket.inet_aton(ip))[0]
    except socket.error:
        return 0  # Invalid IP defaults to 0

Example: "192.168.1.1" → 3232235777.

At inference time, ids.py uses hash(ip_str) % (10**8) instead of ip_to_int(). This is a deliberate design choice to avoid a dependency on socket in the hot path. The resulting integers differ from training values, but the model generalises well enough to accept them.

Attack classes

The model distinguishes seven output classes:

Class	Description
`Normal`	Legitimate traffic — no threat
`SYN Flood`	TCP SYN packet storm exhausting server connection tables
`DDoS Distribuido`	High-volume traffic from multiple sources to one destination
`PORT scanner`	Systematic probing of many destination ports
`Posible Exploit`	Connection attempt to historically vulnerable ports (SMB, RDP, FTP, …)
`UDP Flood`	Volumetric UDP packet storm
`Inyección SQL`	HTTP payload containing SQL injection patterns

Minority classes were grouped into a catch-all "Otros" bucket during training to reduce extreme class imbalance.

`clasificar_ataque_ml` — inference

ids.py

def clasificar_ataque_ml(ip_src, ip_dst, puerto, protocolo, flag):

Returns a (tipo_str, confianza) tuple where confianza is the maximum probability across all classes (0.0 – 1.0).

ids.py

df_entrada = pd.DataFrame({
    'src_ip_int':       [hash(str(ip_src)) % (10**8)],
    'dst_ip_int':       [hash(str(ip_dst)) % (10**8)],
    'dst_port':         [puerto],
    'protocol_encoded': [protocolo_encoded],
    'flag_encoded':     [flag_encoded],
    'hour':             [pd.Timestamp.now().hour]
})

tipo_pred = modelo_ml.predict(df_entrada)[0]
probs     = modelo_ml.predict_proba(df_entrada)[0]
confianza = probs.max()
tipo_str  = tipo_ataque_encoder.inverse_transform([tipo_pred])[0]

Confidence threshold

The caller (guardar_ataque) applies a 70% threshold before accepting the ML verdict:

ids.py

if pred_ml and pred_ml != "Normal" and confianza >= 0.70:
    tipo_final = f"{pred_ml} (ML: {confianza*100:.1f}%)"
else:
    tipo_final = f"{tipo_ataque} (Heurística)"

≥ 70% — ML label is used. The event is displayed as e.g. SYN Flood (ML: 98.4%).
< 70% — The heuristic label that triggered the detector is used instead. The IPS can still block the IP based on heuristic certainty alone.

The 70% threshold was chosen based on the model’s 91.90% training accuracy. Lowering it increases ML-labeled detections but may also increase false positives in edge cases.

Model files

All five artifacts are placed in the project root directory alongside ids.py:

File	Contents
`modelo_ensamble_optimizado.pkl`	Full `ImbPipeline` (SMOTE → StandardScaler → VotingClassifier)
`features_seleccionadas.pkl`	`list[str]` of the six selected feature names in training order
`flag_encoder.pkl`	`LabelEncoder` fitted on TCP flag strings
`protocol_encoder.pkl`	`LabelEncoder` fitted on protocol strings (`TCP`, `UDP`, …)
`tipo_ataque_encoder.pkl`	`LabelEncoder` fitted on attack class labels

Each file is loaded at ids.py startup with a try/except FileNotFoundError. If any file is missing, the corresponding variable is set to None and the system falls back to heuristic-only detection without crashing.

Deleting or replacing a single .pkl file without regenerating the others will cause a mismatch between the encoder vocabularies and the model’s class indices, leading to incorrect label decoding. Always regenerate all five files together by re-running CEREBRO.py.

Overview

Getting Started

Core Components

Detection & Attacks

Integrations

Architecture

Training pipeline

Feature set

Label encoding

`ip_to_int` conversion

Attack classes

`clasificar_ataque_ml` — inference

Confidence threshold

Model files

Build docs developers (and LLMs) love

Overview

Getting Started

Core Components

Detection & Attacks

Integrations

​Architecture

​Training pipeline

​Feature set

​Label encoding

​ip_to_int conversion

​Attack classes

​clasificar_ataque_ml — inference

​Confidence threshold

​Model files

Build docs developers (and LLMs) love

Architecture

Training pipeline

Feature set

Label encoding

`ip_to_int` conversion

Attack classes

`clasificar_ataque_ml` — inference

Confidence threshold

Model files