Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/andresshm/fini-marketing-intelligence/llms.txt

Use this file to discover all available pages before exploring further.

Once the raw CSV files have been generated and validated, the loading stage transfers them into a PostgreSQL data warehouse structured around a simple star schema: two dimension tables (dim_products, dim_customers) and one fact table (fact_sales). The loader reads connection credentials from environment variables, truncates all tables before every run to avoid duplicates, and inserts rows in an order that satisfies foreign-key constraints — dimensions first, facts last.
clear_tables() is called automatically at the start of every load run. It issues a TRUNCATE TABLE ... CASCADE that permanently removes all existing rows from fact_sales, dim_customers, and dim_products. Never run the loader against a production database unless you intend to replace all data.

Configuration — etl/config.py

Database credentials are read from environment variables at import time using python-dotenv. Only POSTGRES_HOST has a default value (localhost); all other variables must be set explicitly.
import os
from dotenv import load_dotenv

load_dotenv()

DB_CONFIG = {
    "host":     os.getenv("POSTGRES_HOST", "localhost"),
    "port":     os.getenv("POSTGRES_PORT"),
    "database": os.getenv("POSTGRES_DB"),
    "user":     os.getenv("POSTGRES_USER"),
    "password": os.getenv("POSTGRES_PASSWORD"),
}

Environment variables

POSTGRES_HOST
string
default:"localhost"
Hostname or IP address of the PostgreSQL server. Defaults to localhost when the variable is unset.
POSTGRES_PORT
string
required
Port on which PostgreSQL is listening. Typically 5432.
POSTGRES_DB
string
required
Name of the target database.
POSTGRES_USER
string
required
PostgreSQL role / username used to authenticate.
POSTGRES_PASSWORD
string
required
Password for the PostgreSQL role.

Example .env file

Create a .env file in the project root (never commit it to version control):
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DB=fini_dw
POSTGRES_USER=fini_user
POSTGRES_PASSWORD=supersecret

Connection string

etl/load_to_postgres.py assembles a SQLAlchemy connection URL from DB_CONFIG and creates an engine:
from sqlalchemy import create_engine
from etl.config import DB_CONFIG

DATABASE_URL = (
    f"postgresql+psycopg2://"
    f"{DB_CONFIG['user']}:{DB_CONFIG['password']}"
    f"@{DB_CONFIG['host']}:{DB_CONFIG['port']}"
    f"/{DB_CONFIG['database']}"
)

engine = create_engine(DATABASE_URL)
The format follows the standard SQLAlchemy pattern:
postgresql+psycopg2://user:password@host:port/database

load_table()etl/load_utils.py

load_table is the single utility responsible for reading a CSV and appending its rows to a PostgreSQL table. It uses pandas to read the file and DataFrame.to_sql with if_exists="append" so that it never recreates the table schema — it only inserts rows.
import pandas as pd
from sqlalchemy import text

def load_table(engine, csv_path, table_name):

    df = pd.read_csv(csv_path)

    df.to_sql(
        table_name,
        engine,
        if_exists="append",
        index=False
    )

    print(f"{len(df)} registros cargados en {table_name}.")
ParameterTypeDescription
enginesqlalchemy.EngineActive SQLAlchemy engine connected to the target database.
csv_pathpathlib.PathPath to the source CSV file (relative to the project root).
table_namestrName of the destination PostgreSQL table.

clear_tables()etl/load_to_postgres.py

Before loading, all three tables are truncated in a single atomic transaction. The CASCADE clause ensures that any dependent foreign-key rows are removed in the correct order without needing to manually disable constraints:
from sqlalchemy import text

def clear_tables(engine):
    """
    Vacía el Data Warehouse respetando
    las relaciones entre tablas.
    """

    with engine.begin() as conn:

        conn.execute(
            text("""
                TRUNCATE TABLE
                    fact_sales,
                    dim_customers,
                    dim_products
                CASCADE;
            """)
        )

    print("Tablas limpiadas correctamente.")

Loading order

After truncation, the three tables are loaded in dependency order — dimension tables before the fact table — so that foreign-key constraints are satisfied at the time each row is inserted:
tables = [
    ("data/raw/products.csv", "dim_products"),
    ("data/raw/customers.csv", "dim_customers"),
    ("data/raw/sales.csv",    "fact_sales"),
]

for csv_path, table_name in tables:
    print(f"\nCargando {table_name}...")
    load_table(engine, Path(csv_path), table_name)
1

dim_products

data/raw/products.csvdim_products (20 rows). Must be loaded first because fact_sales references product_id.
2

dim_customers

data/raw/customers.csvdim_customers (5,000 rows). Must be loaded before fact_sales because of the customer_id foreign key.
3

fact_sales

data/raw/sales.csvfact_sales (100,000 rows). Loaded last after both dimension tables are populated.

Running the loader

python -m etl.load_to_postgres
Or as part of the full pipeline:
python run_pipeline.py
The loader prints progress to stdout for each table and confirms when all loads are complete.

Build docs developers (and LLMs) love