The PruebaETL database uses a dual-schema approach: RAW tables for original data preservation and normalized tables for optimized queries and analysis.
RAW Tables
Store original, unprocessed data exactly as received
Normalized Tables
Structured data with relationships and constraints
The database PruebaTecnicaDNI is created automatically by the ETL process:
-- crear_base_datos.py:11-22IF NOT EXISTS (SELECT * FROM sys.databases WHERE name = 'PruebaTecnicaDNI')BEGIN CREATE DATABASE PruebaTecnicaDNI;ENDGOUSE PruebaTecnicaDNI;GO
The schema includes indexes on frequently queried columns:
-- crear_base_datos.py:148-158CREATE INDEX IX_clientes_ciudad_id ON clientes(ciudad_id);CREATE INDEX IX_clientes_segmento_id ON clientes(segmento_id);CREATE INDEX IX_ventas_cliente_id ON ventas(cliente_id);CREATE INDEX IX_ventas_fecha ON ventas(fecha);CREATE INDEX IX_ventas_moneda_id ON ventas(moneda_id);CREATE INDEX IX_ventas_canal_id ON ventas(canal_id);
Why These Indexes?
Index
Purpose
Query Example
IX_clientes_ciudad_id
Filter customers by city
WHERE ciudad_id = 3
IX_clientes_segmento_id
Filter customers by segment
WHERE segmento_id = 1
IX_ventas_cliente_id
Join sales to customers
JOIN clientes ON ...
IX_ventas_fecha
Filter sales by date range
WHERE fecha BETWEEN '2024-01-01' AND '2024-12-31'
IX_ventas_moneda_id
Group sales by currency
GROUP BY moneda_id
IX_ventas_canal_id
Analyze sales by channel
GROUP BY canal_id
Indexes speed up read queries but slightly slow down write operations. This trade-off is acceptable for a data warehouse focused on analysis.
# crear_base_datos.py:186-194sql_inserts.append("SET IDENTITY_INSERT ciudades ON;")for ciudad in ciudades: nombre = ciudad['nombre'].replace("'", "''").replace('\n', ' ') sql_inserts.append( f"INSERT INTO ciudades (ciudad_id, nombre) " f"VALUES ({ciudad['ciudad_id']}, '{nombre}');" )sql_inserts.append("SET IDENTITY_INSERT ciudades OFF;")
SET IDENTITY_INSERT ON allows manual insertion of IDs, bypassing the auto-increment. This is necessary to preserve the relationships from the normalization process.
SELECT c.nombre AS ciudad, COUNT(v.venta_id) AS total_ventas, SUM(v.total) AS monto_totalFROM ventas vINNER JOIN clientes cl ON v.cliente_id = cl.cliente_idINNER JOIN ciudades c ON cl.ciudad_id = c.ciudad_idGROUP BY c.nombreORDER BY monto_total DESC;
SELECT ca.nombre AS canal, m.codigo AS moneda, COUNT(v.venta_id) AS num_ventas, AVG(v.total) AS promedioFROM ventas vINNER JOIN canales ca ON v.canal_id = ca.canal_idINNER JOIN monedas m ON v.moneda_id = m.moneda_idGROUP BY ca.nombre, m.codigoORDER BY canal, moneda;
SELECT s.nombre AS segmento, COUNT(DISTINCT c.cliente_id) AS num_clientes, COUNT(v.venta_id) AS num_ventas, SUM(v.total) AS ventas_totalesFROM clientes cINNER JOIN segmentos s ON c.segmento_id = s.segmento_idLEFT JOIN ventas v ON c.cliente_id = v.cliente_idGROUP BY s.nombreORDER BY ventas_totales DESC;
The normalized schema makes these queries fast and efficient thanks to foreign key indexes.
-- Check table countSELECT COUNT(*) FROM sys.tables;-- Should return 8 (2 RAW + 6 normalized)-- Check foreign keysSELECT fk.name AS foreign_key_name, tp.name AS parent_table, tr.name AS referenced_tableFROM sys.foreign_keys fkINNER JOIN sys.tables tp ON fk.parent_object_id = tp.object_idINNER JOIN sys.tables tr ON fk.referenced_object_id = tr.object_id;-- Check indexesSELECT i.name AS index_name, t.name AS table_name, COL_NAME(ic.object_id, ic.column_id) AS column_nameFROM sys.indexes iINNER JOIN sys.index_columns ic ON i.object_id = ic.object_id AND i.index_id = ic.index_idINNER JOIN sys.tables t ON i.object_id = t.object_idWHERE i.is_primary_key = 0;