Skip to main content

PruebaETL Documentation

Transform messy CSV data into clean, normalized SQL Server databases with automatic encoding detection, data cleaning, and schema generation

Quick start

Get up and running with PruebaETL in minutes

1

Prepare your CSV files

Place your customer and sales CSV files in the project directory. The tool supports multiple encodings (UTF-8, Latin-1, CP1252) and will automatically detect the correct one.
# Your files should be named:
clientes_prueba_mas_datos.csv
ventas_prueba_mas_datos.csv
2

Normalize customer data

Run the customer normalization script to clean and normalize your customer data, extracting cities and segments into separate tables.
python normalizar_datos.py
The script generates four files in the current directory:
  • ciudades.csv - Unique cities with IDs
  • segmentos.csv - Unique customer segments with IDs
  • clientes_normalizados.csv - Normalized customer data with foreign keys
  • clientes_normalizados_completo.csv - Denormalized view for reference
3

Normalize sales data

Run the sales normalization script to process transaction data.
python normalizar_ventas.py
This creates normalized sales files with proper data types and reference tables for channels and currencies.
4

Organize files

Move the generated CSV files into a cleanData directory.
mkdir cleanData
mv *.csv cleanData/
# Move original input files back to root
mv cleanData/clientes_prueba_mas_datos.csv cleanData/ventas_prueba_mas_datos.csv .
5

Generate and execute SQL script

Create the SQL Server database schema and load your normalized data.
# Generate the SQL script
python crear_base_datos.py

# This creates crear_base_datos.sql
# Open it in SQL Server Management Studio and execute
The generated script creates the PruebaTecnicaDNI database with both raw and normalized tables, complete with foreign keys and indexes.

Key features

Everything you need for robust ETL processing

Multi-encoding support

Automatically detects and handles UTF-8, Latin-1, CP1252, and ISO-8859-1 encodings

Data normalization

Extracts redundant data into separate tables following relational database design principles

Flexible date parsing

Accepts multiple date formats and normalizes to ISO standard (YYYY-MM-DD)

SQL Server integration

Generates complete SQL scripts with DDL, constraints, indexes, and data insertion

Data quality

Cleans text, handles missing values, validates data types, and removes duplicates

Audit trail

Preserves raw data in separate tables for compliance and debugging

Explore by topic

Dive deeper into specific aspects of the ETL pipeline

ETL process

Understand the complete data transformation workflow

Database schema

Learn about the normalized relational structure

Customer data

How customer records are processed and cleaned

Sales data

Transaction data normalization and validation

Schema generation

SQL DDL and DML generation from normalized data

Functions reference

Complete API documentation for all Python functions

Ready to transform your data?

Start normalizing your CSV files and building production-ready SQL Server databases in minutes

Get Started