Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/obedc295/proyect_dw/llms.txt

Use this file to discover all available pages before exploring further.

ETL Dinámico is a modular, runtime-configurable ETL (Extract, Transform, Load) system built entirely in Python. It bridges an operational SQL Server database (OLTP) with a Data Warehouse (OLAP) by orchestrating extraction, transformation, and incremental loading — all driven by configuration that is defined at runtime through a Streamlit interface or through Python code directly, without touching a single line of pipeline logic. Traditional ETL tools often require hardcoded mappings, vendor-specific tooling, or rigid schema assumptions. ETL Dinámico takes a different approach: every pipeline run is described by a list of column mappings and a business key passed at call time. This means the same engine can load a DimCustomers table today and a FactOrders table tomorrow, with zero changes to the core codebase.

Why ETL Dinámico Exists

Most small-to-medium data warehouse projects face a common problem: the ETL layer starts as a handful of SQL scripts or hardcoded Pandas notebooks and quickly becomes a maintenance burden as tables and transformation rules evolve. ETL Dinámico was built to solve this by separating what to load (configuration) from how to load it (the pipeline engine), giving data engineers and analysts a single, reusable system they can point at any SQL Server table pair. Key motivations for the project:
  • Dynamic runtime configuration — column mappings, transformation types, business keys, and source/destination tables are all passed as data, not code.
  • Incremental loading by business key — the loader queries existing keys in the destination table and only appends rows that are not already present, making re-runs safe and idempotent.
  • No vendor lock-in — the stack is pure Python: pyodbc for ODBC connectivity, SQLAlchemy for engine abstraction, Pandas for in-memory transformation, and Streamlit for a self-contained UI.
  • Testable by design — each layer is a plain Python class with no global side effects, making unit testing with pytest straightforward.

System Architecture: Four Layers

ETL Dinámico is structured as four distinct layers. Data and configuration flow downward through the stack; results and status flow back upward.

Layer 1 — Settings (src/config/settings.py)

The Settings class is the single source of truth for database connectivity. It reads six environment variables from a .env file using python-dotenv, assembles raw ODBC connection strings, and encodes them into fully qualified SQLAlchemy URLs using urllib.parse.quote_plus. Both the OLTP (source) and OLAP (destination) engines are configured here. No other layer touches environment variables directly.
# The Settings class produces two SQLAlchemy-ready URLs:
# settings.OLTP_URL  — points at the source transactional database
# settings.OLAP_URL  — points at the Data Warehouse destination

Layer 2 — DatabaseClient (src/infrastructure/db_client.py)

DatabaseClient wraps two SQLAlchemy engines — one for OLTP, one for OLAP — and exposes a clean interface for the rest of the application. It provides connection context managers, table name inspection, and column metadata retrieval. The Streamlit UI and the ETL services both depend on DatabaseClient as their single point of database access.
from src.infrastructure.db_client import DatabaseClient

db_client = DatabaseClient()
# Inspect available tables in the source
oltp_tables = db_client.get_oltp_tables()
# Inspect column names and types in any table
columns = db_client.get_source_columns("dbo.Customers")

Layer 3 — Services (src/services/)

The services layer contains four classes that each own one responsibility in the ETL flow:
ClassFileResponsibility
DataExtractorextractor.pyReads rows from the OLTP source via pd.read_sql. Supports full-table extraction, column selection, row limits, and raw SQL queries.
DataTransformertransformer.pyApplies in-memory Pandas transformations: upper/lower case, concat of two columns, and year/month/day date extraction.
DataLoaderloader.pyPerforms incremental upsert-by-exclusion: queries existing business keys from the OLAP table and appends only new rows using DataFrame.to_sql.
ETLPipelinepipeline.pyOrchestrates the three services above. Its run_dynamic_etl() method accepts a source table, a target table, a business key, and a list of column mapping dictionaries, and returns a result dictionary with row counts and a data sample.

Layer 4 — Streamlit UI (app.py)

The Streamlit application provides a fully interactive, browser-based control panel for the pipeline. Users can select a source table (or write a custom SQL query), inspect column types, configure column-to-column mappings with transformation rules, specify a concatenation mapping for combining two columns into one, choose the destination table in the Data Warehouse, set the business key for incremental loading, and execute the pipeline — all without writing any Python. Results are displayed inline with row-count metrics, a JSON detail view, and a data preview table.

Supported Data Sources

ETL Dinámico currently supports Microsoft SQL Server as both the OLTP source and the OLAP destination. Connectivity is provided through the ODBC Driver 17 for SQL Server, using Windows Authentication (Trusted_Connection=yes) so that no username or password needs to be stored in the .env file or in code. Both databases can live on the same SQL Server instance (common in development with a local SQLEXPRESS installation) or on separate servers in a production environment.

Key Concepts

OLTP Source vs. OLAP Destination

The OLTP (Online Transaction Processing) database is your operational system — typically normalized tables updated in real time by business applications. The OLAP (Online Analytical Processing) database is your Data Warehouse — typically a star or snowflake schema with denormalized dimension and fact tables designed for analytical queries. ETL Dinámico reads from OLTP and writes to OLAP, handling the structural mismatch between the two through its column mapping and transformation system.

Incremental Loading with Business Keys

Rather than truncating and reloading destination tables on every run, ETL Dinámico uses a business key strategy. Before writing any rows, DataLoader.load_incremental() queries the set of existing values in the business key column of the destination table. Rows whose business key already exists in the destination are filtered out; only net-new rows are appended. This makes pipeline runs repeatable and safe — re-running after a partial failure will never produce duplicate records.

Column Mappings and Transform Types

Each pipeline run is driven by a list of column mapping dictionaries. Two distinct shapes are supported: Standard mapping — applies a single-column transformation:
{
    "source_column": "CompanyName",   # column name in the source DataFrame
    "transform_type": "upper",        # one of: none, upper, lower, year, month, day
    "target_column": "CompanyName"    # column name to write in the destination DataFrame
}
Concatenation mapping — combines two source columns into one:
{
    "type": "concat",
    "column1": "FirstName",
    "column2": "LastName",
    "target_column": "FullName"
}
This design means transformation logic lives in data, not in code, making every pipeline configuration inspectable, serializable, and reproducible.

Prerequisites

Before installing ETL Dinámico, ensure the following are available on your system:
All three prerequisites must be in place before running pip install -r requirements.txt. The pyodbc package links against the ODBC Driver at install time on some platforms.
  • Python 3.10 or higher — required for the list[dict] and str | None type hint syntax used throughout the codebase.
  • Microsoft SQL Server — both the source (OLTP) and destination (OLAP/Data Warehouse) databases must already exist and be accessible. A local SQL Server Express instance is sufficient for development.
  • ODBC Driver 17 for SQL Server — the named driver ODBC Driver 17 for SQL Server must be registered on the machine running ETL Dinámico. Download it from the Microsoft documentation.

Explore the Documentation

Quickstart

Install the project, configure your .env file, and run your first ETL pipeline in five steps.

Configuration

Deep-dive into all six environment variables, connection string encoding, and multi-server setups.

ETL Pipeline

Learn how ETLPipeline.run_dynamic_etl() orchestrates extraction, transformation, and incremental loading.

UI Guide

Walk through the Streamlit interface: table selection, column mapping, transformation rules, and executing a run.

Build docs developers (and LLMs) love