Overview
PruebaETL implements a complete ETL (Extract, Transform, Load) pipeline to process customer and sales data, transforming messy CSV files into a clean, normalized SQL Server database.The ETL process preserves original data in RAW tables while creating optimized normalized tables for queries and analysis.
The Four-Stage Pipeline
The ETL process consists of four sequential stages, each handling a specific aspect of data transformation:Stage 1: Customer Normalization
Script:normalizar_datos.py
This stage processes the customer data CSV file with the following operations:
Extract
Reads
clientes_prueba_mas_datos.csv with multiple encoding fallback strategies (UTF-8 → Latin-1 → CP1252 → ISO-8859-1)Clean
- Removes extra whitespace and special characters
- Validates required fields (nombre, ciudad, segmento)
- Handles empty values with defaults
Normalize Dates
Converts dates from multiple formats (DD/MM/YYYY, MM/DD/YYYY, YYYY-MM-DD) to ISO format (YYYY-MM-DD)
ciudades.csv- Unique cities with IDssegmentos.csv- Unique customer segments with IDsclientes_normalizados.csv- Customers with foreign key referencesclientes_normalizados_completo.csv- Human-readable version with resolved names
Stage 2: Sales Normalization
Script:normalizar_ventas.py
Processes sales transactions with similar techniques:
Code Example:
canales.csv- Sales channels with IDsmonedas.csv- Currencies with IDsventas_normalizadas.csv- Sales with foreign key referencesventas_normalizadas_completo.csv- Human-readable version
Stage 3: SQL Schema Generation
Script:crear_base_datos.py
Generates a complete SQL Server DDL script:
Normalized Tables
Creates master tables (ciudades, segmentos, canales, monedas) and business tables (clientes, ventas)
crear_base_datos.sql- Complete SQL script ready for execution
Stage 4: SQL Server Execution
The final stage involves running the generated SQL script:Key Technical Decisions
Encoding Resilience
Multiple encoding fallback strategies ensure data is read correctly regardless of source encoding
Date Flexibility
Parser accepts 8+ date formats and normalizes to ISO 8601 standard
Data Preservation
RAW tables preserve original data for audit, debugging, and reprocessing
UTF-8 BOM Output
Output files use UTF-8 with BOM for Excel compatibility
Running the ETL Pipeline
Execute the pipeline in order:Each script validates that required input files exist before processing. Make sure CSV files are in the correct directory.
Data Quality Issues Solved
The ETL pipeline addresses common data quality problems:| Problem | Solution |
|---|---|
| Incorrect character encoding (“RaðLPA@rez”) | Multi-encoding detection + UTF-8 output |
| Mixed date formats | Flexible parser with 8 format attempts |
| Numbers as text (“1,403.70”) | Strip commas/spaces, convert to float |
| Missing values (N/A, empty strings) | Convert to NULL with validation |
| Data redundancy (“La Romana” repeated 50+ times) | Extract to master tables with IDs |
Next Steps
Normalization Strategy
Learn how entities are extracted and normalized
Database Schema
Explore the complete database structure and relationships