Skip to main content

Why Normalize?

Normalization is the process of organizing data to reduce redundancy and improve data integrity. In PruebaETL, raw CSV files contain repeated values like city names and customer segments hundreds of times.
Without normalization, “La Romana” might appear 50+ times in the data, wasting storage space and making updates error-prone.

Normalization Benefits

Reduces Redundancy

Store unique values once, reference by ID

Ensures Consistency

No typos or variations of the same value

Simplifies Updates

Change a value in one place, affects all references

Improves Performance

Smaller tables, faster queries, efficient indexes

Entities Extracted

PruebaETL normalizes four types of entities from the raw data:

1. Cities (Ciudades)

Extracted From: Customer data Purpose: Store unique city names referenced by customer records Implementation:
# normalizar_datos.py:63-65
clientes = []
ciudades_set = set()
segmentos_set = set()

# ... process customers ...

# normalizar_datos.py:124-126
if ciudad:
    ciudades_set.add(ciudad)

# normalizar_datos.py:138-139
ciudades = [{'ciudad_id': i+1, 'nombre': ciudad} 
            for i, ciudad in enumerate(sorted(ciudades_set))]
Result:
ciudad_idnombre
1La Romana
2Santiago
3Santo Domingo
4Sin especificar
The set() data structure automatically ensures uniqueness - duplicate city names are collapsed into a single entry.

2. Segments (Segmentos)

Extracted From: Customer data Purpose: Categorize customers into business segments (Corporate, Consumer, Home Office) Implementation:
# normalizar_datos.py:127-128
if segmento:
    segmentos_set.add(segmento)

# normalizar_datos.py:140-141
segmentos = [{'segmento_id': i+1, 'nombre': segmento} 
             for i, segmento in enumerate(sorted(segmentos_set))]
Result:
segmento_idnombre
1Consumer
2Corporate
3Home Office

3. Sales Channels (Canales)

Extracted From: Sales data Purpose: Track where sales originated (Online, Retail, Direct) Implementation:
# normalizar_ventas.py:49-51
ventas = []
canales_set = set()
monedas_set = set()

# ... process sales ...

# normalizar_ventas.py:113-114
if canal:
    canales_set.add(canal)

# normalizar_ventas.py:123-124
canales = [{'canal_id': i+1, 'nombre': canal} 
           for i, canal in enumerate(sorted(canales_set))]
Result:
canal_idnombre
1Direct
2Online
3Retail

4. Currencies (Monedas)

Extracted From: Sales data Purpose: Support multi-currency transactions Implementation:
# normalizar_ventas.py:111-112
if moneda:
    monedas_set.add(moneda)

# normalizar_ventas.py:125-126
monedas = [{'moneda_id': i+1, 'codigo': moneda} 
           for i, moneda in enumerate(sorted(monedas_set))]
Result:
moneda_idcodigo
1DOP
2EUR
3USD

Reference Generation

After extracting unique entities, the normalization process creates mappings to replace text with IDs:

Customer References

# normalizar_datos.py:144-147
ciudad_map = {ciudad['nombre']: ciudad['ciudad_id'] 
              for ciudad in ciudades}
segmento_map = {segmento['nombre']: segmento['segmento_id'] 
                for segmento in segmentos}

# normalizar_datos.py:150-161
clientes_normalizados = []
for cliente in clientes:
    ciudad_id = ciudad_map.get(cliente['ciudad'])
    segmento_id = segmento_map.get(cliente['segmento'])
    
    clientes_normalizados.append({
        'cliente_id': cliente['cliente_id'],
        'nombre': cliente['nombre'],
        'ciudad_id': ciudad_id,  # ID instead of text
        'segmento_id': segmento_id,  # ID instead of text
        'fecha_registro': cliente['fecha_registro']
    })
Before Normalization:
cliente_id,nombre,ciudad,segmento,fecha_registro
1,Juan Pérez,Santo Domingo,Consumer,2024-01-15
2,María García,Santo Domingo,Corporate,2024-02-20
After Normalization:
cliente_id,nombre,ciudad_id,segmento_id,fecha_registro
1,Juan Pérez,3,1,2024-01-15
2,María García,3,2,2024-02-20
Notice how “Santo Domingo” (12 characters) is replaced with “3” (1 character) - this is the storage efficiency gain from normalization.

Sales References

# normalizar_ventas.py:128-131
canal_map = {canal['nombre']: canal['canal_id'] 
             for canal in canales}
moneda_map = {moneda['codigo']: moneda['moneda_id'] 
              for moneda in monedas}

# normalizar_ventas.py:133-145
ventas_normalizadas = []
for venta in ventas:
    canal_id = canal_map.get(venta['canal'])
    moneda_id = moneda_map.get(venta['moneda'])
    
    ventas_normalizadas.append({
        'venta_id': venta['venta_id'],
        'cliente_id': venta['cliente_id'],
        'fecha': venta['fecha'],
        'total': venta['total'],
        'moneda_id': moneda_id,  # ID instead of text
        'canal_id': canal_id  # ID instead of text
    })

Data Cleaning Integration

Normalization happens after data cleaning:
1

Clean Text

# normalizar_datos.py:33-45
def limpiar_texto(texto):
    if not texto:
        return None
    
    texto = texto.strip()
    # Reemplazar múltiples espacios por uno solo
    texto = re.sub(r'\s+', ' ', texto)
    
    if texto == '':
        return None
    
    return texto
2

Apply to Entity

ciudad = limpiar_texto(row.get('ciudad', ''))
segmento = limpiar_texto(row.get('segmento', ''))
3

Add to Set

if ciudad:
    ciudades_set.add(ciudad)
if segmento:
    segmentos_set.add(segmento)

Handling Edge Cases

Missing Cities

When a customer has no city specified:
# normalizar_datos.py:113-115
if not ciudad:
    ciudad = "Sin especificar"
This ensures referential integrity - every customer must have a valid city_id.

Invalid Customer References

Sales records must reference valid customers:
# normalizar_ventas.py:87-94
cliente_id = row.get('cliente_id', '').strip()
if not cliente_id or cliente_id == '':
    continue  # Skip this sale

try:
    cliente_id = int(cliente_id)
except ValueError:
    continue  # Skip if not a valid integer

Output Files

The normalization process generates two types of output for each entity:

1. Normalized Format (for database)

Purpose: Optimized for database insertion with foreign keys Example: clientes_normalizados.csv
cliente_id,nombre,ciudad_id,segmento_id,fecha_registro
1,Juan Pérez,3,1,2024-01-15

2. Complete Format (for human review)

Purpose: Human-readable with resolved references Example: clientes_normalizados_completo.csv
cliente_id,nombre,ciudad,segmento,fecha_registro
1,Juan Pérez,Santo Domingo,Consumer,2024-01-15
# normalizar_datos.py:186-203
clientes_completos = []
for cliente in clientes_normalizados:
    ciudad_nombre = next((c['nombre'] for c in ciudades 
                         if c['ciudad_id'] == cliente['ciudad_id']), '')
    segmento_nombre = next((s['nombre'] for s in segmentos 
                           if s['segmento_id'] == cliente['segmento_id']), '')
    
    clientes_completos.append({
        'cliente_id': cliente['cliente_id'],
        'nombre': cliente['nombre'],
        'ciudad': ciudad_nombre,
        'segmento': segmento_nombre,
        'fecha_registro': cliente['fecha_registro']
    })

Storage Impact

Normalization significantly reduces storage requirements:
MetricBeforeAfterReduction
City text repeated50+ times1 time~95%
Segment text repeated100+ times1 time~98%
Channel text repeated200+ times1 time~99%
Overall data size100%~20%~80%
With normalization, the database uses approximately 80% less storage for repeated values.

Next Steps

ETL Process

Understand the complete Extract, Transform, Load pipeline

Database Schema

See how normalized entities form the database structure

Build docs developers (and LLMs) love