Skip to main content
The IDS/IPS ML pipeline starts with a labeled dataset. Because real labeled network captures are difficult to obtain at sufficient volume, the project uses synthetic data generated by generar_dataset.py. The synthetic traffic is designed to mirror realistic distributions — mostly benign, with several attack classes at minority frequencies — giving the model a representative training signal without requiring a live network tap. A secondary module, guardar_dataset.py, extends this dataset during live operation by appending every detected event to a CSV file, creating a feedback loop for future retraining.

Why synthetic data

Bootstrapping a supervised IDS model requires thousands of labeled examples per attack class. Capturing and hand-labeling real traffic is time-consuming and legally complex. Synthetic generation lets the team:
  • Control class proportions precisely
  • Reproduce the dataset exactly (random.choices with fixed weights)
  • Iterate on feature design without waiting for real attacks
  • Avoid privacy concerns associated with real network captures
Once the system is deployed, guardar_dataset.py logs real events alongside their ML predictions, allowing the dataset to grow organically over time.

generate_dataset() function

Defined in generar_dataset.py, this function builds the full training CSV in a single call.
def generate_dataset(num_samples=20000):
    data = []
    base_time = datetime.now() - timedelta(days=7)

    # Simulated target
    target_ip = "172.10.14.181"

    # Persistent scanner IPs for Port Scan class
    scanner_ip_1 = "203.0.113.10"
    scanner_ip_2 = "203.0.113.11"

    for _ in range(num_samples):
        # Realistic class distribution — majority normal traffic
        attack_type = random.choices(
            ["Normal", "SYN Flood", "DDoS Distribuido", "PORT scanner",
             "Posible Exploit", "UDP Flood", "Inyección SQL"],
            weights=[0.60, 0.08, 0.08, 0.08, 0.05, 0.08, 0.03],
            k=1
        )[0]
        ...

Class distribution

ClassWeightApproximate records (20 000)
Normal60%~12 000
SYN Flood8%~1 600
DDoS Distribuido8%~1 600
PORT scanner8%~1 600
UDP Flood8%~1 600
Posible Exploit5%~1 000
Inyección SQL3%~600
The 60/40 normal-to-attack split reflects realistic enterprise traffic and prevents the model from over-fitting to attack patterns. SMOTE is applied later during training to compensate for the remaining minority-class imbalance.

Key generation details

  • Target IP: All generated traffic is directed at 172.10.14.181, the simulated server.
  • Port Scan IPs: Only 203.0.113.10 and 203.0.113.11 appear as source IPs for PORT scanner records, making scanner attribution deterministic and learnable.
  • Timestamp range: Events are spread over a 7-day window (604 800 seconds of jitter) to give the model varied hour values.
  • Output path: Dataset/escanerpuertos.csv (directory is created automatically via os.makedirs).

Dataset columns

ColumnTypeDescription
src_ipstringSource IP address of the connection
dst_ipstringDestination IP address (always 172.10.14.181 in synthetic data)
dst_portintegerDestination port number
protocolstringTransport protocol: TCP or UDP
flagstringTCP control flag (S, A, PA, FA) or N/A for UDP
tipo_ataquestringGround-truth attack class label
timestampstringEvent datetime in YYYY-MM-DD HH:MM:SS format

Per-class field values

Each class has distinct field signatures that make it statistically learnable:
Classdst_portprotocolflag
Normal80, 443, 53, 22TCP or UDPA / PA / FA (TCP), N/A (UDP)
SYN Flood80, 443TCPS
DDoS Distribuido80, 443, 8080TCP or UDPS (TCP), N/A (UDP)
PORT scanner1–1024 (random)TCPS
Posible Exploit445, 3389, 21, 23TCPPA
UDP Flood1024–65535 (random)UDPN/A
Inyección SQL3306, 5432, 1433, 80TCPPA

Real-time event logging — guardar_dataset.py

During live IDS operation, every analyzed packet is appended to eventos_detectados.csv by guardar_evento_en_dataset(). This gives the team a growing log of real traffic that can supplement or replace the synthetic dataset in future training runs.
The output path in guardar_dataset.py is hardcoded to C:\Users\Usuario\Desktop\IDS\IDS_unipaz\Dataset. Update the carpeta variable at the top of the function to match your deployment path before using this module.
def guardar_evento_en_dataset(ip_src, ip_dst, puerto, protocolo, flag,
                              tipo_ataque, tipo_ataque_ml):
    carpeta = r"C:\Users\Usuario\Desktop\IDS\IDS_unipaz\Dataset"
    if not os.path.exists(carpeta):
        os.makedirs(carpeta)

    ruta_completa = os.path.join(carpeta, "eventos_detectados.csv")

    with open(ruta_completa, 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow([
            time.ctime(),    # Timestamp of the saved event
            ip_src,
            ip_dst,
            puerto,
            protocolo,
            flag,
            tipo_ataque,     # Heuristic classification
            tipo_ataque_ml   # ML model classification
        ])
The logged file stores both the heuristic label and the ML prediction side-by-side, making it straightforward to compare the two classifiers and identify disagreements for manual review.
Events logged by guardar_dataset.py can be merged with escanerpuertos.csv before retraining to incrementally improve the model on real-world traffic patterns.

How to run

python generar_dataset.py
This writes Dataset/escanerpuertos.csv and prints the actual class value counts to stdout:
Generando 20000 registros de tráfico sintético...
✅ ¡Dataset generado con éxito en Dataset/escanerpuertos.csv!
Normal              12045
SYN Flood            1612
DDoS Distribuido     1598
...
Run this step before executing CEREBRO.py. The dataset file must exist at Dataset/escanerpuertos.csv relative to the working directory where CEREBRO.py is launched.

Build docs developers (and LLMs) love