The IDS/IPS ML pipeline starts with a labeled dataset. Because real labeled network captures are difficult to obtain at sufficient volume, the project uses synthetic data generated by generar_dataset.py. The synthetic traffic is designed to mirror realistic distributions — mostly benign, with several attack classes at minority frequencies — giving the model a representative training signal without requiring a live network tap.
A secondary module, guardar_dataset.py, extends this dataset during live operation by appending every detected event to a CSV file, creating a feedback loop for future retraining.
Why synthetic data
Bootstrapping a supervised IDS model requires thousands of labeled examples per attack class. Capturing and hand-labeling real traffic is time-consuming and legally complex. Synthetic generation lets the team:
- Control class proportions precisely
- Reproduce the dataset exactly (
random.choices with fixed weights)
- Iterate on feature design without waiting for real attacks
- Avoid privacy concerns associated with real network captures
Once the system is deployed, guardar_dataset.py logs real events alongside their ML predictions, allowing the dataset to grow organically over time.
generate_dataset() function
Defined in generar_dataset.py, this function builds the full training CSV in a single call.
def generate_dataset(num_samples=20000):
data = []
base_time = datetime.now() - timedelta(days=7)
# Simulated target
target_ip = "172.10.14.181"
# Persistent scanner IPs for Port Scan class
scanner_ip_1 = "203.0.113.10"
scanner_ip_2 = "203.0.113.11"
for _ in range(num_samples):
# Realistic class distribution — majority normal traffic
attack_type = random.choices(
["Normal", "SYN Flood", "DDoS Distribuido", "PORT scanner",
"Posible Exploit", "UDP Flood", "Inyección SQL"],
weights=[0.60, 0.08, 0.08, 0.08, 0.05, 0.08, 0.03],
k=1
)[0]
...
Class distribution
| Class | Weight | Approximate records (20 000) |
|---|
| Normal | 60% | ~12 000 |
| SYN Flood | 8% | ~1 600 |
| DDoS Distribuido | 8% | ~1 600 |
| PORT scanner | 8% | ~1 600 |
| UDP Flood | 8% | ~1 600 |
| Posible Exploit | 5% | ~1 000 |
| Inyección SQL | 3% | ~600 |
The 60/40 normal-to-attack split reflects realistic enterprise traffic and prevents the model from over-fitting to attack patterns. SMOTE is applied later during training to compensate for the remaining minority-class imbalance.
Key generation details
- Target IP: All generated traffic is directed at
172.10.14.181, the simulated server.
- Port Scan IPs: Only
203.0.113.10 and 203.0.113.11 appear as source IPs for PORT scanner records, making scanner attribution deterministic and learnable.
- Timestamp range: Events are spread over a 7-day window (604 800 seconds of jitter) to give the model varied
hour values.
- Output path:
Dataset/escanerpuertos.csv (directory is created automatically via os.makedirs).
Dataset columns
| Column | Type | Description |
|---|
src_ip | string | Source IP address of the connection |
dst_ip | string | Destination IP address (always 172.10.14.181 in synthetic data) |
dst_port | integer | Destination port number |
protocol | string | Transport protocol: TCP or UDP |
flag | string | TCP control flag (S, A, PA, FA) or N/A for UDP |
tipo_ataque | string | Ground-truth attack class label |
timestamp | string | Event datetime in YYYY-MM-DD HH:MM:SS format |
Per-class field values
Each class has distinct field signatures that make it statistically learnable:
| Class | dst_port | protocol | flag |
|---|
| Normal | 80, 443, 53, 22 | TCP or UDP | A / PA / FA (TCP), N/A (UDP) |
| SYN Flood | 80, 443 | TCP | S |
| DDoS Distribuido | 80, 443, 8080 | TCP or UDP | S (TCP), N/A (UDP) |
| PORT scanner | 1–1024 (random) | TCP | S |
| Posible Exploit | 445, 3389, 21, 23 | TCP | PA |
| UDP Flood | 1024–65535 (random) | UDP | N/A |
| Inyección SQL | 3306, 5432, 1433, 80 | TCP | PA |
Real-time event logging — guardar_dataset.py
During live IDS operation, every analyzed packet is appended to eventos_detectados.csv by guardar_evento_en_dataset(). This gives the team a growing log of real traffic that can supplement or replace the synthetic dataset in future training runs.
The output path in guardar_dataset.py is hardcoded to C:\Users\Usuario\Desktop\IDS\IDS_unipaz\Dataset. Update the carpeta variable at the top of the function to match your deployment path before using this module.
def guardar_evento_en_dataset(ip_src, ip_dst, puerto, protocolo, flag,
tipo_ataque, tipo_ataque_ml):
carpeta = r"C:\Users\Usuario\Desktop\IDS\IDS_unipaz\Dataset"
if not os.path.exists(carpeta):
os.makedirs(carpeta)
ruta_completa = os.path.join(carpeta, "eventos_detectados.csv")
with open(ruta_completa, 'a', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow([
time.ctime(), # Timestamp of the saved event
ip_src,
ip_dst,
puerto,
protocolo,
flag,
tipo_ataque, # Heuristic classification
tipo_ataque_ml # ML model classification
])
The logged file stores both the heuristic label and the ML prediction side-by-side, making it straightforward to compare the two classifiers and identify disagreements for manual review.
Events logged by guardar_dataset.py can be merged with escanerpuertos.csv before retraining to incrementally improve the model on real-world traffic patterns.
How to run
python generar_dataset.py
This writes Dataset/escanerpuertos.csv and prints the actual class value counts to stdout:
Generando 20000 registros de tráfico sintético...
✅ ¡Dataset generado con éxito en Dataset/escanerpuertos.csv!
Normal 12045
SYN Flood 1612
DDoS Distribuido 1598
...
Run this step before executing CEREBRO.py. The dataset file must exist at Dataset/escanerpuertos.csv relative to the working directory where CEREBRO.py is launched.