Dataset Generation

The IDS/IPS ML pipeline starts with a labeled dataset. Because real labeled network captures are difficult to obtain at sufficient volume, the project uses synthetic data generated by generar_dataset.py. The synthetic traffic is designed to mirror realistic distributions — mostly benign, with several attack classes at minority frequencies — giving the model a representative training signal without requiring a live network tap. A secondary module, guardar_dataset.py, extends this dataset during live operation by appending every detected event to a CSV file, creating a feedback loop for future retraining.

Why synthetic data

Bootstrapping a supervised IDS model requires thousands of labeled examples per attack class. Capturing and hand-labeling real traffic is time-consuming and legally complex. Synthetic generation lets the team:

Control class proportions precisely
Reproduce the dataset exactly (random.choices with fixed weights)
Iterate on feature design without waiting for real attacks
Avoid privacy concerns associated with real network captures

Once the system is deployed, guardar_dataset.py logs real events alongside their ML predictions, allowing the dataset to grow organically over time.

`generate_dataset()` function

Defined in generar_dataset.py, this function builds the full training CSV in a single call.

def generate_dataset(num_samples=20000):
    data = []
    base_time = datetime.now() - timedelta(days=7)

    # Simulated target
    target_ip = "172.10.14.181"

    # Persistent scanner IPs for Port Scan class
    scanner_ip_1 = "203.0.113.10"
    scanner_ip_2 = "203.0.113.11"

    for _ in range(num_samples):
        # Realistic class distribution — majority normal traffic
        attack_type = random.choices(
            ["Normal", "SYN Flood", "DDoS Distribuido", "PORT scanner",
             "Posible Exploit", "UDP Flood", "Inyección SQL"],
            weights=[0.60, 0.08, 0.08, 0.08, 0.05, 0.08, 0.03],
            k=1
        )[0]
        ...

Class distribution

Class	Weight	Approximate records (20 000)
Normal	60%	~12 000
SYN Flood	8%	~1 600
DDoS Distribuido	8%	~1 600
PORT scanner	8%	~1 600
UDP Flood	8%	~1 600
Posible Exploit	5%	~1 000
Inyección SQL	3%	~600

The 60/40 normal-to-attack split reflects realistic enterprise traffic and prevents the model from over-fitting to attack patterns. SMOTE is applied later during training to compensate for the remaining minority-class imbalance.

Key generation details

Target IP: All generated traffic is directed at 172.10.14.181, the simulated server.
Port Scan IPs: Only 203.0.113.10 and 203.0.113.11 appear as source IPs for PORT scanner records, making scanner attribution deterministic and learnable.
Timestamp range: Events are spread over a 7-day window (604 800 seconds of jitter) to give the model varied hour values.
Output path: Dataset/escanerpuertos.csv (directory is created automatically via os.makedirs).

Dataset columns

Column	Type	Description
`src_ip`	string	Source IP address of the connection
`dst_ip`	string	Destination IP address (always `172.10.14.181` in synthetic data)
`dst_port`	integer	Destination port number
`protocol`	string	Transport protocol: `TCP` or `UDP`
`flag`	string	TCP control flag (`S`, `A`, `PA`, `FA`) or `N/A` for UDP
`tipo_ataque`	string	Ground-truth attack class label
`timestamp`	string	Event datetime in `YYYY-MM-DD HH:MM:SS` format

Per-class field values

Each class has distinct field signatures that make it statistically learnable:

Class	`dst_port`	`protocol`	`flag`
Normal	80, 443, 53, 22	TCP or UDP	A / PA / FA (TCP), N/A (UDP)
SYN Flood	80, 443	TCP	S
DDoS Distribuido	80, 443, 8080	TCP or UDP	S (TCP), N/A (UDP)
PORT scanner	1–1024 (random)	TCP	S
Posible Exploit	445, 3389, 21, 23	TCP	PA
UDP Flood	1024–65535 (random)	UDP	N/A
Inyección SQL	3306, 5432, 1433, 80	TCP	PA

Real-time event logging — `guardar_dataset.py`

During live IDS operation, every analyzed packet is appended to eventos_detectados.csv by guardar_evento_en_dataset(). This gives the team a growing log of real traffic that can supplement or replace the synthetic dataset in future training runs.

The output path in guardar_dataset.py is hardcoded to C:\Users\Usuario\Desktop\IDS\IDS_unipaz\Dataset. Update the carpeta variable at the top of the function to match your deployment path before using this module.

def guardar_evento_en_dataset(ip_src, ip_dst, puerto, protocolo, flag,
                              tipo_ataque, tipo_ataque_ml):
    carpeta = r"C:\Users\Usuario\Desktop\IDS\IDS_unipaz\Dataset"
    if not os.path.exists(carpeta):
        os.makedirs(carpeta)

    ruta_completa = os.path.join(carpeta, "eventos_detectados.csv")

    with open(ruta_completa, 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow([
            time.ctime(),    # Timestamp of the saved event
            ip_src,
            ip_dst,
            puerto,
            protocolo,
            flag,
            tipo_ataque,     # Heuristic classification
            tipo_ataque_ml   # ML model classification
        ])

The logged file stores both the heuristic label and the ML prediction side-by-side, making it straightforward to compare the two classifiers and identify disagreements for manual review.

Events logged by guardar_dataset.py can be merged with escanerpuertos.csv before retraining to incrementally improve the model on real-world traffic patterns.

How to run

python generar_dataset.py

This writes Dataset/escanerpuertos.csv and prints the actual class value counts to stdout:

Generando 20000 registros de tráfico sintético...
✅ ¡Dataset generado con éxito en Dataset/escanerpuertos.csv!
Normal              12045
SYN Flood            1612
DDoS Distribuido     1598
...

Run this step before executing CEREBRO.py. The dataset file must exist at Dataset/escanerpuertos.csv relative to the working directory where CEREBRO.py is launched.

Training

Testing & Simulation

Why synthetic data

`generate_dataset()` function

Class distribution

Key generation details

Dataset columns

Per-class field values

Real-time event logging — `guardar_dataset.py`

How to run

Build docs developers (and LLMs) love

Training

Testing & Simulation

​Why synthetic data

​generate_dataset() function

​Class distribution

​Key generation details

​Dataset columns

​Per-class field values

​Real-time event logging — guardar_dataset.py

​How to run

Build docs developers (and LLMs) love

Why synthetic data

`generate_dataset()` function

Class distribution

Key generation details

Dataset columns

Per-class field values

Real-time event logging — `guardar_dataset.py`

How to run