Quick Start Guide

Overview

PruebaETL is an ETL (Extract, Transform, Load) pipeline that normalizes customer and sales data from CSV files and loads them into a SQL Server database. This guide will walk you through the complete workflow.

What you’ll accomplish:

Normalize messy CSV files with encoding issues
Extract redundant data into separate reference tables
Generate a SQL Server database with proper relationships
Load data with full referential integrity

Prerequisites

Before starting, ensure you have:

Python 3.7 or higher installed
SQL Server instance (local or remote)
SQL Server Management Studio (SSMS) or Azure Data Studio
Raw CSV files: clientes_prueba_mas_datos.csv and ventas_prueba_mas_datos.csv

Step-by-Step Workflow

Normalize Customer Data

Run the customer normalization script to clean and structure customer data:

python normalizar_datos.py

What it does:

Fixes character encoding issues (UTF-8, Latin-1, CP1252)
Normalizes date formats to YYYY-MM-DD
Cleans text fields (removes extra spaces, special characters)
Extracts cities and segments into separate tables
Generates foreign key references

Sample Input:

cliente_id,nombre,ciudad,segmento,fecha_registro
1,Clínica Guerrero S.A.,La Romana,Individual,2024-11-23
2,Raúl Pérez,San Pedro de Macorís,Retail,2024-12-12
3,Ferretería Guerrero Corp.,San Pedro de Macorís,Individual,2024-01-17

Output Files:

✓ Creado: clientes_normalizados.csv (350 registros)
✓ Creado: ciudades.csv (15 ciudades)
✓ Creado: segmentos.csv (3 segmentos)
✓ Creado: clientes_normalizados_completo.csv (350 registros)

View normalized customer schema

clientes_normalizados.csv:

cliente_id,nombre,ciudad_id,segmento_id,fecha_registro
1,Clínica Guerrero S.A.,10,2,2024-11-23
2,Raúl Pérez,12,3,2024-12-12

ciudades.csv:

ciudad_id,nombre
1,Azua
2,Barahona
3,Higüey

segmentos.csv:

segmento_id,nombre
1,Corporativo
2,Individual
3,Retail

Normalize Sales Data

Run the sales normalization script to process transaction data:

python normalizar_ventas.py

What it does:

Normalizes numeric values (removes commas, handles N/A)
Converts dates to standard format
Extracts sales channels and currencies into lookup tables
Validates foreign key relationships with customers

Sample Input:

venta_id,cliente_id,fecha,total,moneda,canal
1001,66,2025-03-27,1316.31,DOP,kiosko
1002,24,2025-01-09,2912.88,DOP,web
1003,208,2025-02-17,464.1,DOP,callcenter

Output Files:

✓ Creado: ventas_normalizadas.csv (500 registros)
✓ Creado: canales.csv (4 canales)
✓ Creado: monedas.csv (2 monedas)
✓ Creado: ventas_normalizadas_completo.csv (500 registros)

View normalized sales schema

ventas_normalizadas.csv:

venta_id,cliente_id,fecha,total,moneda_id,canal_id
1001,66,2025-03-27,1316.31,1,2
1002,24,2025-01-09,2912.88,1,4

canales.csv:

canal_id,nombre
1,callcenter
2,kiosko
3,tienda
4,web

monedas.csv:

moneda_id,codigo
1,DOP
2,USD

Organize Normalized Files

Move the generated CSV files into a cleanData directory for SQL script generation:

# Create the directory
mkdir cleanData

# Move normalized files
mv ciudades.csv segmentos.csv clientes_normalizados.csv cleanData/
mv canales.csv monedas.csv ventas_normalizadas.csv cleanData/

The SQL generation script expects normalized files in the cleanData/ directory. The original input files should remain in the project root.

Generate SQL Server Script

Create the database schema and INSERT statements:

python crear_base_datos.py

What it does:

Reads all normalized CSV files from cleanData/ directory
Generates DDL for database PruebaTecnicaDNI
Creates both RAW tables (original data) and normalized tables
Defines primary keys, foreign keys, and indexes
Generates INSERT statements for all data

Output:

✓ Script SQL generado: crear_base_datos.sql
  Tamaño: 125000 caracteres

Para ejecutar el script:
  1. Abre SQL Server Management Studio
  2. Conéctate a tu servidor SQL Server
  3. Abre el archivo 'crear_base_datos.sql'
  4. Ejecuta el script completo

View generated SQL structure

The script creates the following tables:RAW Tables (Original Data):

clientes_raw - Original customer records
ventas_raw - Original sales transactions

Normalized Tables:

ciudades - City lookup table
segmentos - Customer segment types
canales - Sales channel types
monedas - Currency codes
clientes - Normalized customer master data
ventas - Normalized sales transactions

Relationships:

clientes → ciudades (FK: ciudad_id)
clientes → segmentos (FK: segmento_id)
ventas → clientes (FK: cliente_id)
ventas → monedas (FK: moneda_id)
ventas → canales (FK: canal_id)

Execute in SQL Server

Load the data into your SQL Server instance:Using SQL Server Management Studio:

Open SSMS and connect to your SQL Server
Open crear_base_datos.sql file
Execute the entire script (F5 or click Execute)
Verify the database was created successfully

Using Azure Data Studio:

Connect to your SQL Server instance
Open crear_base_datos.sql
Click Run or press F5
Check the Messages pane for confirmation

Using sqlcmd (Command Line):

sqlcmd -S localhost -U sa -P YourPassword -i crear_base_datos.sql

Expected Output:

Database 'PruebaTecnicaDNI' created successfully
Tables created: 10
Raw records inserted: 850
Normalized records inserted: 850
Foreign keys established: 5
Indexes created: 6

If the database already exists, the script will drop and recreate all tables. Make sure to backup any existing data before running the script.

Verify the Results

Run validation queries to confirm the data loaded correctly:

USE PruebaTecnicaDNI;
GO

-- Check record counts
SELECT 'Clientes' AS Tabla, COUNT(*) AS Total FROM clientes
UNION ALL
SELECT 'Ventas', COUNT(*) FROM ventas
UNION ALL
SELECT 'Ciudades', COUNT(*) FROM ciudades
UNION ALL
SELECT 'Segmentos', COUNT(*) FROM segmentos;

Sample Queries:

SELECT 
    c.nombre AS Cliente,
    COUNT(v.venta_id) AS TotalVentas,
    SUM(v.total) AS MontoTotal,
    m.codigo AS Moneda
FROM ventas v
JOIN clientes c ON v.cliente_id = c.cliente_id
JOIN monedas m ON v.moneda_id = m.moneda_id
GROUP BY c.nombre, m.codigo
ORDER BY MontoTotal DESC;

Understanding the Code

Date Normalization

The ETL pipeline handles multiple date formats automatically:

def normalizar_fecha(fecha_str):
    """Normalizes different date formats to YYYY-MM-DD"""
    if not fecha_str or fecha_str.strip() == '':
        return None
    
    fecha_str = fecha_str.strip()
    formatos = [
        '%Y-%m-%d',      # 2024-11-23
        '%Y/%m/%d',      # 2024/11/23
        '%d/%m/%Y',      # 23/11/2024
        '%d-%m-%Y',      # 23-11-2024
        '%m/%d/%Y',      # 11/23/2024
    ]
    
    for formato in formatos:
        try:
            fecha = datetime.strptime(fecha_str, formato)
            return fecha.strftime('%Y-%m-%d')
        except ValueError:
            continue
    
    return None

Text Cleaning

Removes extra spaces and handles empty values:

def limpiar_texto(texto):
    if not texto:
        return None
    
    texto = texto.strip()
    # Replace multiple spaces with single space
    texto = re.sub(r'\s+', ' ', texto)
    
    if texto == '':
        return None
    
    return texto

Number Normalization

Handles formatted numbers and special values:

def normalizar_numero(numero_str):
    if not numero_str or numero_str.strip() == '' or numero_str.upper() == 'N/A':
        return None
    
    # Remove commas and spaces
    numero_str = numero_str.replace(',', '').replace(' ', '').strip()
    
    try:
        return float(numero_str)
    except ValueError:
        return None

Troubleshooting

Character Encoding Issues

Problem: Characters like “Raúl Pérez” appear as “RaÃ°LPA@rez”Solution: The scripts automatically try multiple encodings (UTF-8, Latin-1, CP1252, ISO-8859-1):

encodings = ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']
for encoding in encodings:
    try:
        with open('clientes_prueba_mas_datos.csv', 'r', encoding=encoding) as f:
            contenido = list(csv.DictReader(f))
            print(f"Archivo leído correctamente con codificación: {encoding}")
            break
    except UnicodeDecodeError:
        continue

If issues persist, check your CSV file encoding and save it as UTF-8.

File Not Found Error

Problem: FileNotFoundError: clientes_prueba_mas_datos.csvSolution:

Ensure your CSV files are in the same directory as the Python scripts
Verify file names match exactly (case-sensitive on Linux/Mac)
Check for hidden file extensions (.csv.txt)

# List files in current directory
ls -la *.csv

# Check if files exist
python -c "import os; print(os.listdir('.'))"

SQL Connection Failed

Problem: Cannot connect to SQL ServerSolution:

Verify SQL Server is running:

# Windows
net start MSSQLSERVER

# Linux
sudo systemctl status mssql-server

Check connection string and credentials
Ensure TCP/IP is enabled in SQL Server Configuration Manager
Verify firewall allows port 1433

Foreign Key Constraint Violations

Problem: FK_ventas_cliente constraint violationSolution: This happens when sales reference customers that don’t exist. The normalization scripts validate this automatically:

# In normalizar_ventas.py
cliente_id = row.get('cliente_id', '').strip()
if not cliente_id or cliente_id == '':
    continue  # Skip sales without valid customer_id

If you still see errors:

Run normalizar_datos.py before normalizar_ventas.py
Verify all customer IDs in sales exist in the customer file
Check the console output for skipped records

Empty Output Files

Problem: Normalized CSV files are empty or missing recordsSolution: Check the console output for validation errors:

Leyendo archivo de clientes...
Archivo leído correctamente con codificación: utf-8
✓ Creado: clientes_normalizados.csv (0 registros)

If records = 0, the data may not pass validation. Common issues:

Missing required fields (nombre, segmento)
Invalid data types
Empty rows in source CSV

Review your source CSV for data quality issues.

Script Fails During Insert

Problem: SQL script fails partway through executionSolution: The script uses SET IDENTITY_INSERT for reference tables. If it fails:

Manually turn off identity insert:

SET IDENTITY_INSERT ciudades OFF;
SET IDENTITY_INSERT segmentos OFF;
SET IDENTITY_INSERT canales OFF;
SET IDENTITY_INSERT monedas OFF;

Drop and recreate the database:
```
DROP DATABASE PruebaTecnicaDNI;
```
Re-run the complete script from the beginning

Next Steps

Now that your database is set up, you can:

Query Your Data: Use the sample queries above to analyze sales patterns
Build Reports: Connect Power BI or Tableau to visualize trends
Add More Data: Run the normalization scripts with updated CSV files
Schedule ETL: Automate the pipeline with cron jobs or Task Scheduler
Extend the Schema: Add new tables for products, orders, or inventory

Pro Tip: The normalized structure reduces data redundancy by ~80% and significantly improves query performance through proper indexing and relationships.

Common Patterns

Incrementally Loading New Data

To add new records without recreating the database:

# Modify crear_base_datos.py to append instead of replace
# Remove the DROP TABLE statements
# Use INSERT INTO instead of CREATE TABLE

Handling Updates to Master Data

When cities or segments change:

-- Add a new city
INSERT INTO ciudades (nombre) VALUES ('Nueva Ciudad');

-- Update customer with new city
UPDATE clientes 
SET ciudad_id = (SELECT ciudad_id FROM ciudades WHERE nombre = 'Nueva Ciudad')
WHERE cliente_id = 123;

Performance Optimization

The scripts create indexes automatically, but for large datasets:

-- Add additional indexes for frequently queried columns
CREATE INDEX IX_ventas_fecha_total ON ventas(fecha, total);
CREATE INDEX IX_clientes_segmento_ciudad ON clientes(segmento_id, ciudad_id);

File Structure

After running all scripts, your project structure should look like:

project/
├── clientes_prueba_mas_datos.csv          # Original customer data
├── ventas_prueba_mas_datos.csv            # Original sales data
├── normalizar_datos.py                     # Customer normalization script
├── normalizar_ventas.py                    # Sales normalization script
├── crear_base_datos.py                     # SQL generation script
├── cleanData/                              # Generated normalized files
│   ├── ciudades.csv
│   ├── segmentos.csv
│   ├── clientes_normalizados.csv
│   ├── canales.csv
│   ├── monedas.csv
│   └── ventas_normalizadas.csv
└── crear_base_datos.sql                    # Generated SQL script (ready to execute)

Summary

You’ve successfully completed the PruebaETL workflow:

Normalized messy CSV data with encoding and format issues
Structured data into a relational schema (3NF)
Generated a complete SQL Server database with referential integrity
Loaded data with proper primary and foreign key relationships

The result is a clean, optimized database ready for analytics and reporting.

Getting Started

Core Concepts

Data Processing

Database

Reference

Overview

Prerequisites

Step-by-Step Workflow

Understanding the Code

Date Normalization

Text Cleaning

Number Normalization

Troubleshooting

Next Steps

Common Patterns

Incrementally Loading New Data

Handling Updates to Master Data

Performance Optimization

File Structure

Summary

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Data Processing

Database

Reference

​Overview

​Prerequisites

​Step-by-Step Workflow

​Understanding the Code

​Date Normalization

​Text Cleaning

​Number Normalization

​Troubleshooting

​Next Steps

​Common Patterns

​Incrementally Loading New Data

​Handling Updates to Master Data

​Performance Optimization

​File Structure

​Summary

Build docs developers (and LLMs) love

Overview

Prerequisites

Step-by-Step Workflow

Understanding the Code

Date Normalization

Text Cleaning

Number Normalization

Troubleshooting

Next Steps

Common Patterns

Incrementally Loading New Data

Handling Updates to Master Data

Performance Optimization

File Structure

Summary