ETL Process

Overview

PruebaETL implements a complete ETL (Extract, Transform, Load) pipeline to process customer and sales data, transforming messy CSV files into a clean, normalized SQL Server database.

The ETL process preserves original data in RAW tables while creating optimized normalized tables for queries and analysis.

The Four-Stage Pipeline

The ETL process consists of four sequential stages, each handling a specific aspect of data transformation:

Stage 1: Customer Normalization

Script: normalizar_datos.py This stage processes the customer data CSV file with the following operations:

Extract

Reads clientes_prueba_mas_datos.csv with multiple encoding fallback strategies (UTF-8 → Latin-1 → CP1252 → ISO-8859-1)

Clean

Removes extra whitespace and special characters
Validates required fields (nombre, ciudad, segmento)
Handles empty values with defaults

Normalize Dates

Converts dates from multiple formats (DD/MM/YYYY, MM/DD/YYYY, YYYY-MM-DD) to ISO format (YYYY-MM-DD)

Extract Entities

Creates separate master tables for cities and segments by extracting unique values

Generate IDs

Assigns foreign key references (ciudad_id, segmento_id) to replace redundant text

Code Example:

# normalizar_datos.py:13-31
def normalizar_fecha(fecha_str):
    """Normaliza diferentes formatos de fecha a YYYY-MM-DD"""
    if not fecha_str or fecha_str.strip() == '':
        return None
    
    fecha_str = fecha_str.strip()

    formatos = [
        '%Y-%m-%d',      
        '%Y/%m/%d',      
        '%d/%m/%Y',      
        '%d-%m-%Y',      
        '%m/%d/%Y',      
    ]
    
    for formato in formatos:
        try:
            fecha = datetime.strptime(fecha_str, formato)
            return fecha.strftime('%Y-%m-%d')
        except ValueError:
            continue

    return None

Output Files:

ciudades.csv - Unique cities with IDs
segmentos.csv - Unique customer segments with IDs
clientes_normalizados.csv - Customers with foreign key references
clientes_normalizados_completo.csv - Human-readable version with resolved names

Stage 2: Sales Normalization

Script: normalizar_ventas.py Processes sales transactions with similar techniques:

Extract

Reads ventas_prueba_mas_datos.csv with encoding detection

Clean Numbers

Removes formatting (commas, spaces) and handles special values (N/A → NULL)

Normalize Dates

Standardizes date formats to YYYY-MM-DD

Extract Entities

Creates master tables for sales channels and currencies

Validate Relationships

Ensures all sales reference valid customer IDs

Code Example:

# normalizar_ventas.py:37-46
def normalizar_numero(numero_str):
    if not numero_str or numero_str.strip() == '' or numero_str.upper() == 'N/A':
        return None
    
    # Remover comas y espacios
    numero_str = numero_str.replace(',', '').replace(' ', '').strip()
    
    try:
        return float(numero_str)
    except ValueError:
        return None

Output Files:

canales.csv - Sales channels with IDs
monedas.csv - Currencies with IDs
ventas_normalizadas.csv - Sales with foreign key references
ventas_normalizadas_completo.csv - Human-readable version

Stage 3: SQL Schema Generation

Script: crear_base_datos.py Generates a complete SQL Server DDL script:

Database Creation

Creates PruebaTecnicaDNI database if it doesn’t exist

RAW Tables

Defines tables for original, unprocessed data (for audit purposes)

Normalized Tables

Creates master tables (ciudades, segmentos, canales, monedas) and business tables (clientes, ventas)

Constraints

Adds primary keys, foreign keys, and unique constraints

Indexes

Creates indexes on foreign keys and frequently queried columns

Data Insertion

Generates INSERT statements for both RAW and normalized data

Output:

crear_base_datos.sql - Complete SQL script ready for execution

Stage 4: SQL Server Execution

The final stage involves running the generated SQL script:

Connect

Open SQL Server Management Studio (SSMS) and connect to your server

Load Script

Open the crear_base_datos.sql file

Execute

Run the complete script to create the database, tables, and insert all data

Verify

Check that all tables are created and data is loaded correctly

Key Technical Decisions

Encoding Resilience

Multiple encoding fallback strategies ensure data is read correctly regardless of source encoding

Date Flexibility

Parser accepts 8+ date formats and normalizes to ISO 8601 standard

Data Preservation

RAW tables preserve original data for audit, debugging, and reprocessing

UTF-8 BOM Output

Output files use UTF-8 with BOM for Excel compatibility

Running the ETL Pipeline

Execute the pipeline in order:

# Step 1: Normalize customers
python normalizar_datos.py

# Step 2: Normalize sales
python normalizar_ventas.py

# Step 3: Generate SQL script
python crear_base_datos.py

# Step 4: Execute in SQL Server (via SSMS)
# Open crear_base_datos.sql and run it

Each script validates that required input files exist before processing. Make sure CSV files are in the correct directory.

Data Quality Issues Solved

The ETL pipeline addresses common data quality problems:

Problem	Solution
Incorrect character encoding (“RaÃ°LPA@rez”)	Multi-encoding detection + UTF-8 output
Mixed date formats	Flexible parser with 8 format attempts
Numbers as text (“1,403.70”)	Strip commas/spaces, convert to float
Missing values (N/A, empty strings)	Convert to NULL with validation
Data redundancy (“La Romana” repeated 50+ times)	Extract to master tables with IDs

Getting Started

Core Concepts

Data Processing

Database

Reference

Overview

The Four-Stage Pipeline

Stage 1: Customer Normalization

Stage 2: Sales Normalization

Stage 3: SQL Schema Generation

Stage 4: SQL Server Execution

Key Technical Decisions

Encoding Resilience

Date Flexibility

Data Preservation

UTF-8 BOM Output

Running the ETL Pipeline

Data Quality Issues Solved

Next Steps

Normalization Strategy

Database Schema

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Data Processing

Database

Reference

​Overview

​The Four-Stage Pipeline

​Stage 1: Customer Normalization

​Stage 2: Sales Normalization

​Stage 3: SQL Schema Generation

​Stage 4: SQL Server Execution

​Key Technical Decisions

Encoding Resilience

Date Flexibility

Data Preservation

UTF-8 BOM Output

​Running the ETL Pipeline

​Data Quality Issues Solved

​Next Steps

Normalization Strategy

Database Schema

Build docs developers (and LLMs) love

Overview

The Four-Stage Pipeline

Stage 1: Customer Normalization

Stage 2: Sales Normalization

Stage 3: SQL Schema Generation

Stage 4: SQL Server Execution

Key Technical Decisions

Running the ETL Pipeline

Data Quality Issues Solved

Next Steps