Skip to main content
CEREBRO.py is the full end-to-end training pipeline. Run it once after generating the dataset and it produces every .pkl artifact the live IDS needs for inference. The script trains a soft-voting ensemble of Random Forest, MLP, and XGBoost inside an ImbPipeline that handles class balancing and feature scaling automatically.

How to run

python CEREBRO.py
Prerequisite: Dataset/escanerpuertos.csv must exist in the working directory. Generate it first with python generar_dataset.py. The script prints training progress and evaluation metrics to stdout, then writes the model and encoder files to the working directory.

Training pipeline

1

Load and clean data

The CSV is read with latin-1 encoding to handle special characters, and bad lines are skipped rather than raising an error. Rows with any NaN value are dropped before any transformation.
df = pd.read_csv('Dataset/escanerpuertos.csv', encoding='latin-1', on_bad_lines='skip')
df = df.dropna()
2

Feature engineering

IP addresses are not usable as strings. ip_to_int() converts each IP to a 32-bit unsigned integer using the network byte order, producing a numeric value the model can reason about ordinally.
def ip_to_int(ip):
    try:
        # socket.inet_aton: converts IP string to 4 binary bytes
        # struct.unpack("!I", ...): interprets bytes as a 32-bit big-endian unsigned int
        return struct.unpack("!I", socket.inet_aton(ip))[0]
    except socket.error:
        return 0  # Invalid or malformed IP — return 0 instead of raising

df['src_ip_int'] = df['src_ip'].apply(ip_to_int)
df['dst_ip_int'] = df['dst_ip'].apply(ip_to_int)
The hour field is extracted from the timestamp column to give the model temporal context:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df['hour'] = df['timestamp'].dt.hour.fillna(0).astype(int)
3

Label encoding

The three categorical string columns are encoded to integers using LabelEncoder. Each encoder is fitted once and immediately serialized so inference-time code uses the identical mapping.
protocol_encoder = LabelEncoder()
flag_encoder = LabelEncoder()
tipo_ataque_encoder = LabelEncoder()

df['protocol_encoded'] = protocol_encoder.fit_transform(df['protocol'])
df['flag_encoded'] = flag_encoder.fit_transform(df['flag'])
df['tipo_ataque_encoded'] = tipo_ataque_encoder.fit_transform(df['tipo_ataque'])

joblib.dump(protocol_encoder, 'protocol_encoder.pkl')
joblib.dump(flag_encoder, 'flag_encoder.pkl')
joblib.dump(tipo_ataque_encoder, 'tipo_ataque_encoder.pkl')
4

Class grouping

After encoding, classes with extremely few samples (encoded as 0, 2, and 3) are remapped to a synthetic class 9 labeled “Otros”. This reduces extreme imbalance while preserving the meaningful distinctions between the majority attack types.
y_raw = df['tipo_ataque_encoded']
y_raw = y_raw.replace({0: 9, 2: 9, 3: 9})

# Re-fit the encoder on the updated label set
tipo_ataque_encoder = LabelEncoder()
tipo_ataque_encoder.fit(y_raw)
joblib.dump(tipo_ataque_encoder, 'tipo_ataque_encoder.pkl')  # Overwrites the earlier save
The encoder is re-fitted and re-saved after grouping. The final tipo_ataque_encoder.pkl reflects the post-grouping class space, not the original seven classes.
5

Feature selection

SelectKBest with the ANOVA F-test scores each feature by its statistical relationship to the target variable. With k=6 and six features available, all features are retained — but the selector still computes importance scores and enforces the ordering expected by the downstream pipeline.
features = ['src_ip_int', 'dst_ip_int', 'dst_port',
            'protocol_encoded', 'flag_encoded', 'hour']

X_raw = df[features]
y_raw = df['tipo_ataque_encoded']

selector = SelectKBest(score_func=f_classif, k=6)
X_selected = selector.fit_transform(X_raw, y_raw)

selected_features = X_raw.columns[selector.get_support(indices=True)]
The ordered feature vector fed to the model is:
PositionFeatureDescription
0src_ip_intSource IP as uint32
1dst_ip_intDestination IP as uint32
2dst_portDestination port number
3protocol_encodedTCP/UDP encoded as integer
4flag_encodedTCP flag encoded as integer
5hourHour of day (0–23)
6

Train/test split

The dataset is split 80/20. stratify=y_raw ensures that every class keeps the same proportion in both partitions, which is critical with an imbalanced dataset.
X_train, X_test, y_train, y_test = train_test_split(
    pd.DataFrame(X_selected, columns=selected_features),
    y_raw,
    test_size=0.2,
    stratify=y_raw,
    random_state=42
)
7

SMOTE oversampling

SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic feature vectors for under-represented classes by interpolating between existing minority samples. It runs as the first step inside the ImbPipeline so it only sees training data — never the test set.This ensures the model is trained on a balanced class distribution without duplicating real records.
8

StandardScaler normalization

After SMOTE, all features are normalized to mean=0, std=1. This step is required for the MLP (which is sensitive to feature scale) and also improves gradient descent convergence for XGBoost. Random Forest is scale-invariant but is unaffected.Scaling happens inside the pipeline, so the scaler’s parameters are learned only from training data and applied consistently at inference time.
9

VotingClassifier (soft voting)

Three classifiers are combined via soft voting. Each model produces a probability vector over all classes; the vectors are averaged and the class with the highest averaged probability wins.
ensemble_model = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(
            n_estimators=100,
            class_weight='balanced',
            random_state=42
        )),
        ('mlp', MLPClassifier(
            hidden_layer_sizes=(64, 64),
            max_iter=300,
            early_stopping=True,
            random_state=42
        )),
        ('xgb', XGBClassifier(
            eval_metric='mlogloss',
            use_label_encoder=False,
            random_state=42
        ))
    ],
    voting='soft',
    n_jobs=-1
)
ModelRole
Random ForestRobust to overfitting; handles non-linear relationships; class_weight='balanced' adds an internal correction for imbalance
MLP (64×64)Captures complex interactions between features; early_stopping prevents overfitting on minority classes
XGBoostHigh accuracy on tabular data; gradient boosting progressively corrects previous errors
10

Fit and save the pipeline

The complete ImbPipeline (SMOTE → StandardScaler → VotingClassifier) is fitted in one call and then serialized alongside the selected feature list.
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('scaler', StandardScaler()),
    ('clf', ensemble_model)
])

pipeline.fit(X_train, y_train)

joblib.dump(pipeline, 'modelo_ensamble_optimizado.pkl')
joblib.dump(selected_features.tolist(), 'features_seleccionadas.pkl')

Output files

After a successful run, the following files are written to the working directory:
FileContents
modelo_ensamble_optimizado.pklFull ImbPipeline: SMOTE + StandardScaler + VotingClassifier
features_seleccionadas.pklOrdered list of feature names used during training
protocol_encoder.pklLabelEncoder fitted on protocol values
flag_encoder.pklLabelEncoder fitted on flag values
tipo_ataque_encoder.pklLabelEncoder fitted on the post-grouping class labels
All five .pkl files must be present for inference to work. The live IDS loads them at startup via joblib.load(). If any file is missing or was produced by a different training run, predictions will fail or produce incorrect class names.

Inference preprocessing

At inference time, preprocesar_datos() in CEREBRO.py transforms a raw packet tuple into the exact feature vector the pipeline expects:
def preprocesar_datos(ip_src, ip_dst, puerto, protocolo, flag, hora):
    protocol_encoder = joblib.load('protocol_encoder.pkl')
    flag_encoder = joblib.load('flag_encoder.pkl')
    selected_features = joblib.load('features_seleccionadas.pkl')

    src_ip_int = ip_to_int(ip_src)
    dst_ip_int = ip_to_int(ip_dst)

    try:
        protocolo_encoded = protocol_encoder.transform([protocolo])[0]
    except ValueError:
        protocolo_encoded = -1  # Unknown protocol — fallback value

    try:
        flag_encoded = flag_encoder.transform([flag])[0]
    except ValueError:
        flag_encoded = -1  # Unknown flag — fallback value

    datos = {
        'src_ip_int': src_ip_int,
        'dst_ip_int': dst_ip_int,
        'dst_port': puerto,
        'protocol_encoded': protocolo_encoded,
        'flag_encoded': flag_encoded,
        'hour': hora
    }

    # Order values to match the exact feature order the model was trained on
    return [datos[feature] for feature in selected_features]

Build docs developers (and LLMs) love