Model Training

CEREBRO.py is the full end-to-end training pipeline. Run it once after generating the dataset and it produces every .pkl artifact the live IDS needs for inference. The script trains a soft-voting ensemble of Random Forest, MLP, and XGBoost inside an ImbPipeline that handles class balancing and feature scaling automatically.

How to run

python CEREBRO.py

Prerequisite: Dataset/escanerpuertos.csv must exist in the working directory. Generate it first with python generar_dataset.py. The script prints training progress and evaluation metrics to stdout, then writes the model and encoder files to the working directory.

Training pipeline

Load and clean data

The CSV is read with latin-1 encoding to handle special characters, and bad lines are skipped rather than raising an error. Rows with any NaN value are dropped before any transformation.

df = pd.read_csv('Dataset/escanerpuertos.csv', encoding='latin-1', on_bad_lines='skip')
df = df.dropna()

Feature engineering

IP addresses are not usable as strings. ip_to_int() converts each IP to a 32-bit unsigned integer using the network byte order, producing a numeric value the model can reason about ordinally.

def ip_to_int(ip):
    try:
        # socket.inet_aton: converts IP string to 4 binary bytes
        # struct.unpack("!I", ...): interprets bytes as a 32-bit big-endian unsigned int
        return struct.unpack("!I", socket.inet_aton(ip))[0]
    except socket.error:
        return 0  # Invalid or malformed IP — return 0 instead of raising

df['src_ip_int'] = df['src_ip'].apply(ip_to_int)
df['dst_ip_int'] = df['dst_ip'].apply(ip_to_int)

The hour field is extracted from the timestamp column to give the model temporal context:

df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df['hour'] = df['timestamp'].dt.hour.fillna(0).astype(int)

Label encoding

The three categorical string columns are encoded to integers using LabelEncoder. Each encoder is fitted once and immediately serialized so inference-time code uses the identical mapping.

protocol_encoder = LabelEncoder()
flag_encoder = LabelEncoder()
tipo_ataque_encoder = LabelEncoder()

df['protocol_encoded'] = protocol_encoder.fit_transform(df['protocol'])
df['flag_encoded'] = flag_encoder.fit_transform(df['flag'])
df['tipo_ataque_encoded'] = tipo_ataque_encoder.fit_transform(df['tipo_ataque'])

joblib.dump(protocol_encoder, 'protocol_encoder.pkl')
joblib.dump(flag_encoder, 'flag_encoder.pkl')
joblib.dump(tipo_ataque_encoder, 'tipo_ataque_encoder.pkl')

Class grouping

After encoding, classes with extremely few samples (encoded as 0, 2, and 3) are remapped to a synthetic class 9 labeled “Otros”. This reduces extreme imbalance while preserving the meaningful distinctions between the majority attack types.

y_raw = df['tipo_ataque_encoded']
y_raw = y_raw.replace({0: 9, 2: 9, 3: 9})

# Re-fit the encoder on the updated label set
tipo_ataque_encoder = LabelEncoder()
tipo_ataque_encoder.fit(y_raw)
joblib.dump(tipo_ataque_encoder, 'tipo_ataque_encoder.pkl')  # Overwrites the earlier save

The encoder is re-fitted and re-saved after grouping. The final tipo_ataque_encoder.pkl reflects the post-grouping class space, not the original seven classes.

Feature selection

SelectKBest with the ANOVA F-test scores each feature by its statistical relationship to the target variable. With k=6 and six features available, all features are retained — but the selector still computes importance scores and enforces the ordering expected by the downstream pipeline.

features = ['src_ip_int', 'dst_ip_int', 'dst_port',
            'protocol_encoded', 'flag_encoded', 'hour']

X_raw = df[features]
y_raw = df['tipo_ataque_encoded']

selector = SelectKBest(score_func=f_classif, k=6)
X_selected = selector.fit_transform(X_raw, y_raw)

selected_features = X_raw.columns[selector.get_support(indices=True)]

The ordered feature vector fed to the model is:

Position	Feature	Description
0	`src_ip_int`	Source IP as uint32
1	`dst_ip_int`	Destination IP as uint32
2	`dst_port`	Destination port number
3	`protocol_encoded`	TCP/UDP encoded as integer
4	`flag_encoded`	TCP flag encoded as integer
5	`hour`	Hour of day (0–23)

Train/test split

The dataset is split 80/20. stratify=y_raw ensures that every class keeps the same proportion in both partitions, which is critical with an imbalanced dataset.

X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(X_selected, columns=selected_features),
    y_raw,
    test_size=0.2,
    stratify=y_raw,
    random_state=42
)

SMOTE oversampling

SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic feature vectors for under-represented classes by interpolating between existing minority samples. It runs as the first step inside the ImbPipeline so it only sees training data — never the test set.This ensures the model is trained on a balanced class distribution without duplicating real records.

StandardScaler normalization

After SMOTE, all features are normalized to mean=0, std=1. This step is required for the MLP (which is sensitive to feature scale) and also improves gradient descent convergence for XGBoost. Random Forest is scale-invariant but is unaffected.Scaling happens inside the pipeline, so the scaler’s parameters are learned only from training data and applied consistently at inference time.

VotingClassifier (soft voting)

Three classifiers are combined via soft voting. Each model produces a probability vector over all classes; the vectors are averaged and the class with the highest averaged probability wins.

ensemble_model = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(
            n_estimators=100,
            class_weight='balanced',
            random_state=42
        )),
        ('mlp', MLPClassifier(
            hidden_layer_sizes=(64, 64),
            max_iter=300,
            early_stopping=True,
            random_state=42
        )),
        ('xgb', XGBClassifier(
            eval_metric='mlogloss',
            use_label_encoder=False,
            random_state=42
        ))
    ],
    voting='soft',
    n_jobs=-1
)

Model	Role
Random Forest	Robust to overfitting; handles non-linear relationships; `class_weight='balanced'` adds an internal correction for imbalance
MLP (64×64)	Captures complex interactions between features; `early_stopping` prevents overfitting on minority classes
XGBoost	High accuracy on tabular data; gradient boosting progressively corrects previous errors

Fit and save the pipeline

The complete ImbPipeline (SMOTE → StandardScaler → VotingClassifier) is fitted in one call and then serialized alongside the selected feature list.

pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('scaler', StandardScaler()),
    ('clf', ensemble_model)
])

pipeline.fit(X_train, y_train)

joblib.dump(pipeline, 'modelo_ensamble_optimizado.pkl')
joblib.dump(selected_features.tolist(), 'features_seleccionadas.pkl')

Output files

After a successful run, the following files are written to the working directory:

File	Contents
`modelo_ensamble_optimizado.pkl`	Full `ImbPipeline`: SMOTE + StandardScaler + VotingClassifier
`features_seleccionadas.pkl`	Ordered list of feature names used during training
`protocol_encoder.pkl`	`LabelEncoder` fitted on `protocol` values
`flag_encoder.pkl`	`LabelEncoder` fitted on `flag` values
`tipo_ataque_encoder.pkl`	`LabelEncoder` fitted on the post-grouping class labels

All five .pkl files must be present for inference to work. The live IDS loads them at startup via joblib.load(). If any file is missing or was produced by a different training run, predictions will fail or produce incorrect class names.

Inference preprocessing

At inference time, preprocesar_datos() in CEREBRO.py transforms a raw packet tuple into the exact feature vector the pipeline expects:

def preprocesar_datos(ip_src, ip_dst, puerto, protocolo, flag, hora):
    protocol_encoder = joblib.load('protocol_encoder.pkl')
    flag_encoder = joblib.load('flag_encoder.pkl')
    selected_features = joblib.load('features_seleccionadas.pkl')

    src_ip_int = ip_to_int(ip_src)
    dst_ip_int = ip_to_int(ip_dst)

    try:
        protocolo_encoded = protocol_encoder.transform([protocolo])[0]
    except ValueError:
        protocolo_encoded = -1  # Unknown protocol — fallback value

    try:
        flag_encoded = flag_encoder.transform([flag])[0]
    except ValueError:
        flag_encoded = -1  # Unknown flag — fallback value

    datos = {
        'src_ip_int': src_ip_int,
        'dst_ip_int': dst_ip_int,
        'dst_port': puerto,
        'protocol_encoded': protocolo_encoded,
        'flag_encoded': flag_encoded,
        'hour': hora
    }

    # Order values to match the exact feature order the model was trained on
    return [datos[feature] for feature in selected_features]

Training

Testing & Simulation

How to run

Training pipeline

Output files

Inference preprocessing

Build docs developers (and LLMs) love

Training

Testing & Simulation

​How to run

​Training pipeline

​Output files

​Inference preprocessing

Build docs developers (and LLMs) love

How to run

Training pipeline

Output files

Inference preprocessing