Functions Reference

Overview

PruebaETL includes utility functions for data normalization, validation, and database generation. All functions are designed to handle messy real-world data with graceful error handling.

Data Normalization Functions

normalizar_fecha()

Description: Normalizes date strings from multiple formats to ISO format (YYYY-MM-DD). Location: normalizar_datos.py:6, normalizar_ventas.py:5 Signature:

def normalizar_fecha(fecha_str: str) -> str | None

fecha_str

string

required

Date string in any supported format

Returns: Normalized date string in YYYY-MM-DD format, or None if invalid/empty. Supported Date Formats:

YYYY-MM-DD (ISO format)
YYYY/MM/DD
DD/MM/YYYY
DD-MM-YYYY
MM/DD/YYYY
M-D-Y variations

Examples:

# Valid dates
normalizar_fecha("2024-11-23")      # Returns: "2024-11-23"
normalizar_fecha("23/11/2024")      # Returns: "2024-11-23"
normalizar_fecha("11-23-2024")      # Returns: "2024-11-23"

# Invalid/empty dates
normalizar_fecha("")                # Returns: None
normalizar_fecha("invalid")         # Returns: None
normalizar_fecha("  ")              # Returns: None

Implementation Details:

Tries multiple format parsers sequentially
Strips whitespace before parsing
Returns None for empty strings or unparseable dates
Uses Python’s datetime.strptime() for validation

The function attempts formats in order. If a date is ambiguous (e.g., “01/02/2024”), it may be interpreted as Jan 2 or Feb 1 depending on format order.

limpiar_texto()

Description: Cleans and normalizes text by removing extra whitespace and handling empty values. Location: normalizar_datos.py:33, normalizar_ventas.py:25 Signature:

def limpiar_texto(texto: str) -> str | None

texto

string

required

Text string to clean

Returns: Cleaned text string, or None if empty/invalid. Processing Steps:

Strips leading/trailing whitespace
Replaces multiple spaces with single space
Returns None if result is empty string

Examples:

# Valid text
limpiar_texto("  Raúl   Pérez  ")    # Returns: "Raúl Pérez"
limpiar_texto("Normal Text")        # Returns: "Normal Text"
limpiar_texto("Multiple   Spaces")  # Returns: "Multiple Spaces"

# Empty/invalid text
limpiar_texto("")                   # Returns: None
limpiar_texto("   ")                # Returns: None
limpiar_texto(None)                 # Returns: None

Use Cases:

Cliente names
City names
Segment names
Channel names
Any text field from CSV

This function preserves special characters (á, é, ñ) while normalizing whitespace.

normalizar_numero()

Description: Converts number strings to float, handling formatting and special values. Location: normalizar_datos.py:47, normalizar_ventas.py:37 Signature:

def normalizar_numero(numero_str: str) -> float | None

numero_str

string

required

Number string with possible formatting (commas, spaces)

Returns: Float value, or None if invalid/empty/N/A. Processing Steps:

Checks for empty string or “N/A”
Removes commas and spaces
Converts to float
Returns None if conversion fails

Examples:

# Valid numbers
normalizar_numero("1403.70")        # Returns: 1403.7
normalizar_numero("1,403.70")       # Returns: 1403.7
normalizar_numero("1 403.70")       # Returns: 1403.7
normalizar_numero("100")            # Returns: 100.0

# Invalid/special values
normalizar_numero("N/A")            # Returns: None
normalizar_numero("n/a")            # Returns: None
normalizar_numero("")               # Returns: None
normalizar_numero("abc")            # Returns: None

Use Cases:

Sale totals
Monetary amounts
Quantity fields
Any numeric field from CSV

This function does not validate currency symbols. Remove currency symbols before calling.

Data Processing Functions

procesar_clientes()

Description: Main function to process cliente CSV and generate normalized output files. Location: normalizar_datos.py:59 Signature:

def procesar_clientes() -> None

Parameters: None (reads from clientes_prueba_mas_datos.csv in current directory) Returns: None (generates CSV files as side effect) Generated Files:

clientes_normalizados.csv - Normalized clientes with foreign keys
ciudades.csv - Unique cities with IDs
segmentos.csv - Unique segments with IDs
clientes_normalizados_completo.csv - Denormalized view for verification

Processing Logic:

Read CSV with multiple encoding fallbacks:
- UTF-8
- Latin-1
- CP1252
- ISO-8859-1
Extract and normalize data:
- Clean all text fields
- Normalize dates
- Handle missing cliente_id (auto-generate)
- Validate required fields
Build master tables:
- Extract unique cities
- Extract unique segments
- Assign auto-increment IDs
Create normalized records:
- Replace city names with city_id
- Replace segment names with segmento_id
Write output files with UTF-8 BOM encoding

Example Usage:

if __name__ == '__main__':
    procesar_clientes()

Console Output:

Leyendo archivo de clientes...
Archivo leído correctamente con codificación: utf-8

Generando archivos normalizados...
✓ Creado: clientes_normalizados.csv (150 registros)
✓ Creado: ciudades.csv (8 ciudades)
✓ Creado: segmentos.csv (3 segmentos)
✓ Creado: clientes_normalizados_completo.csv (150 registros)

¡Normalización completada!

Resumen:
  - Clientes procesados: 150
  - Ciudades únicas: 8
  - Segmentos únicos: 3

If a cliente has no name but has a city and segment, it generates a default name: “Cliente “

procesar_ventas()

Description: Main function to process sales CSV and generate normalized output files. Location: normalizar_ventas.py:48 Signature:

def procesar_ventas() -> None

Parameters: None (reads from ventas_prueba_mas_datos.csv in current directory) Returns: None (generates CSV files as side effect) Generated Files:

ventas_normalizadas.csv - Normalized sales with foreign keys
canales.csv - Unique channels with IDs
monedas.csv - Unique currencies with IDs
ventas_normalizadas_completo.csv - Denormalized view for verification

Processing Logic:

Read CSV with multiple encoding fallbacks
Extract and normalize data:
- Clean text fields (channel, currency)
- Normalize dates
- Normalize numeric totals
- Handle missing venta_id (starts at 1001)
- Validate required fields
Build master tables:
- Extract unique channels
- Extract unique currencies
- Assign auto-increment IDs
Create normalized records:
- Replace channel names with canal_id
- Replace currency codes with moneda_id
Write output files with UTF-8 BOM encoding

Example Usage:

if __name__ == '__main__':
    procesar_ventas()

Console Output:

Leyendo archivo de ventas...
Archivo leído correctamente con codificación: utf-8

Generando archivos normalizados...
✓ Creado: ventas_normalizadas.csv (500 registros)
✓ Creado: canales.csv (4 canales)
✓ Creado: monedas.csv (2 monedas)
✓ Creado: ventas_normalizadas_completo.csv (500 registros)

¡Normalización completada!

Resumen:
  - Ventas procesadas: 500
  - Canales únicos: 4
  - Monedas únicas: 2

Sales records without a valid cliente_id or total are skipped entirely.

SQL Generation Functions

generar_sql_crear_bd()

Description: Generates DDL SQL for creating database and all tables. Location: crear_base_datos.py:8 Signature:

def generar_sql_crear_bd() -> str

Parameters: None Returns: String containing complete SQL script for database and table creation. Generated SQL Includes:

Database creation (PruebaTecnicaDNI)
Raw tables (clientes_raw, ventas_raw)
Master tables (ciudades, segmentos, canales, monedas)
Business tables (clientes, ventas)
Primary key constraints
Foreign key constraints
Indexes on foreign keys and date fields

Example Usage:

sql_script = generar_sql_crear_bd()
print(sql_script)

leer_csv()

Description: Reads a CSV file from the cleanData directory. Location: crear_base_datos.py:165 Signature:

def leer_csv(archivo: str) -> list[dict]

archivo

string

required

CSV filename (will be looked up in cleanData/ directory)

Returns: List of dictionaries, one per CSV row. Example Usage:

ciudades = leer_csv('ciudades.csv')
# Returns: [{'ciudad_id': '1', 'nombre': 'La Romana'}, ...]

clientes = leer_csv('clientes_normalizados.csv')
# Returns: [{'cliente_id': '1', 'nombre': 'Juan Pérez', ...}, ...]

Returns empty list if file doesn’t exist. Prints warning but doesn’t raise exception.

generar_sql_insertar_datos()

Description: Generates INSERT statements for normalized data. Location: crear_base_datos.py:180 Signature:

def generar_sql_insertar_datos() -> str

Parameters: None (reads from cleanData/*.csv files) Returns: String containing SQL INSERT statements. Processing:

Reads all normalized CSV files
Generates INSERT statements in dependency order:
- Ciudades (referenced by clientes)
- Segmentos (referenced by clientes)
- Canales (referenced by ventas)
- Monedas (referenced by ventas)
- Clientes (references ciudades, segmentos)
- Ventas (references clientes, monedas, canales)
Handles IDENTITY columns with SET IDENTITY_INSERT
Escapes single quotes in text fields
Converts empty strings to NULL

Example Output:

-- Insertar Ciudades
SET IDENTITY_INSERT ciudades ON;
INSERT INTO ciudades (ciudad_id, nombre) VALUES (1, 'La Romana');
INSERT INTO ciudades (ciudad_id, nombre) VALUES (2, 'Santo Domingo');
SET IDENTITY_INSERT ciudades OFF;
GO

-- Insertar Clientes
INSERT INTO clientes (cliente_id, nombre, ciudad_id, segmento_id, fecha_registro) 
  VALUES (1, 'Raúl Pérez', 1, 2, '2024-11-23');
GO

generar_sql_insertar_raw()

Description: Generates INSERT statements for raw data from original CSV files. Location: crear_base_datos.py:255 Signature:

def generar_sql_insertar_raw() -> str

Parameters: None (reads from original CSV files in current directory) Returns: String containing SQL INSERT statements for raw tables. Reads From:

clientes_prueba_mas_datos.csv → clientes_raw table
ventas_prueba_mas_datos.csv → ventas_raw table

Purpose: Preserves original data for auditing and debugging.

generar_script_completo()

Description: Combines all SQL generation functions into one complete script. Location: crear_base_datos.py:319 Signature:

def generar_script_completo() -> str

Parameters: None Returns: Complete SQL script with:

Database and table creation
Raw data inserts
Normalized data inserts

Example Usage:

script = generar_script_completo()
with open('crear_base_datos.sql', 'w', encoding='utf-8') as f:
    f.write(script)

guardar_script_sql()

Description: Generates and saves complete SQL script to file. Location: crear_base_datos.py:329 Signature:

def guardar_script_sql() -> None

Parameters: None Returns: None (creates crear_base_datos.sql file) Example Usage:

if __name__ == '__main__':
    print("Generando script SQL para crear la base de datos...")
    guardar_script_sql()
    print("¡Proceso completado!")

Console Output:

Generando script SQL para crear la base de datos...
✓ Script SQL generado: crear_base_datos.sql
  Tamaño: 125000 caracteres

Para ejecutar el script:
  1. Abre SQL Server Management Studio
  2. Conéctate a tu servidor SQL Server
  3. Abre el archivo 'crear_base_datos.sql'
  4. Ejecuta el script completo

¡Proceso completado!

Error Handling

Encoding Fallback

All CSV reading functions try multiple encodings:

encodings = ['utf-8', 'latin-1', 'cp1252', 'iso-8859-1']
for encoding in encodings:
    try:
        with open(filename, 'r', encoding=encoding) as f:
            content = list(csv.DictReader(f))
            break
    except UnicodeDecodeError:
        continue

Missing Fields

Functions gracefully handle missing CSV fields:

cliente_id = row.get('cliente_id', '').strip()
if not cliente_id:
    cliente_id = auto_increment_counter

Validation

Required fields are validated before processing:

if cliente_id and total is not None and moneda and canal:
    # Process record
else:
    # Skip invalid record
    continue

Complete Workflow Example

# Step 1: Normalize clientes
import normalizar_datos
normalizar_datos.procesar_clientes()

# Step 2: Normalize ventas
import normalizar_ventas
normalizar_ventas.procesar_ventas()

# Step 3: Generate SQL script
import crear_base_datos
crear_base_datos.guardar_script_sql()

# Step 4: Execute in SQL Server
# (Manual step using SSMS)

Data Types

SQL Server data types reference

Troubleshooting

Common problems and solutions

Getting Started

Core Concepts

Data Processing

Database

Reference

Functions Reference

Overview

Data Normalization Functions

normalizar_fecha()

limpiar_texto()

normalizar_numero()

Data Processing Functions

procesar_clientes()

procesar_ventas()

SQL Generation Functions

generar_sql_crear_bd()

leer_csv()

generar_sql_insertar_datos()

generar_sql_insertar_raw()

generar_script_completo()

guardar_script_sql()

Error Handling

Encoding Fallback

Missing Fields

Validation

Complete Workflow Example

Data Types

Troubleshooting

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Data Processing

Database

Reference

​Overview

​Data Normalization Functions

​normalizar_fecha()

​limpiar_texto()

​normalizar_numero()

​Data Processing Functions

​procesar_clientes()

​procesar_ventas()

​SQL Generation Functions

​generar_sql_crear_bd()

​leer_csv()

​generar_sql_insertar_datos()

​generar_sql_insertar_raw()

​generar_script_completo()

​guardar_script_sql()

​Error Handling

​Encoding Fallback

​Missing Fields

​Validation

​Complete Workflow Example

​Related Resources

Data Types

Troubleshooting

Build docs developers (and LLMs) love

Overview

Data Normalization Functions

normalizar_fecha()

limpiar_texto()

normalizar_numero()

Data Processing Functions

procesar_clientes()

procesar_ventas()

SQL Generation Functions

generar_sql_crear_bd()

leer_csv()

generar_sql_insertar_datos()

generar_sql_insertar_raw()

generar_script_completo()

guardar_script_sql()

Error Handling

Encoding Fallback

Missing Fields

Validation

Complete Workflow Example

Related Resources