Validating Generated CSV Datasets Before Loading to Postgres

The validation layer runs at the end of the full pipeline, after data generation and database loading have completed. It provides a lightweight sanity check that catches truncated writes, partial runs, or accidental overwrites. The single function validate_csv is called by run_pipeline.py after all pipeline steps — including the PostgreSQL load — have finished, raising an immediate error with a clear message if anything is wrong so the execution report is never written against bad data.

`validate_csv()` — `etl/validation.py`

Full source

import pandas as pd


def validate_csv(path, expected_rows=None):
    """
    Comprueba que el CSV existe,
    no está vacío y opcionalmente
    tiene el número esperado de filas.
    """

    df = pd.read_csv(path)

    if len(df) == 0:
        raise ValueError(f"{path} está vacío")

    if expected_rows and len(df) != expected_rows:
        raise ValueError(
            f"{path}: se esperaban "
            f"{expected_rows} filas "
            f"y se encontraron {len(df)}"
        )

    return len(df)

Parameters

path

str | Path

required

Path to the CSV file to validate (relative to the project root). pd.read_csv will raise a FileNotFoundError automatically if the file does not exist.

expected_rows

int

default:"None"

Optional exact row count the file must contain. When provided and the actual count differs, a ValueError is raised. Pass None (the default) to skip the row-count check and only verify the file is non-empty.

Return value

Returns the integer row count of the validated file. run_pipeline.py captures this value and writes it into the execution report.

Checks performed

The function performs two sequential checks:

Non-empty check

After loading the file with pd.read_csv, the function verifies that len(df) > 0. An empty file raises:

ValueError: <path> está vacío

Row count check (optional)

When expected_rows is provided and the actual row count differs, the function raises:

ValueError: <path>: se esperaban <expected_rows> filas y se encontraron <actual>

If the counts match, validation passes and the row count is returned.

Usage in `run_pipeline.py`

The pipeline runner calls validate_csv after all pipeline steps have completed — including the PostgreSQL load. Each call targets one of the three raw CSV files with the exact row count that the corresponding generator is designed to produce:

from etl.validation import validate_csv

products_rows = validate_csv(
    "data/raw/products.csv",
    expected_rows=20
)

customers_rows = validate_csv(
    "data/raw/customers.csv",
    expected_rows=5000
)

sales_rows = validate_csv(
    "data/raw/sales.csv",
    expected_rows=100000
)

If all three calls succeed, run_pipeline.py logs "Validaciones completadas." and writes the row counts into the timestamped Markdown report in reports/.

Calling it programmatically

You can import and use validate_csv in any script or notebook to check a file independently of the full pipeline:

from etl.validation import validate_csv

# Check file exists and is non-empty (no row count constraint)
row_count = validate_csv("data/raw/products.csv")
print(f"Products file has {row_count} rows")

# Check file exists, is non-empty, and has exactly 20 rows
try:
    validate_csv("data/raw/products.csv", expected_rows=20)
    print("Validation passed")
except ValueError as e:
    print(f"Validation failed: {e}")

You can call validate_csv with a custom expected_rows value at any point in your workflow — for example, after loading a filtered subset to a staging table or after a partial re-generation run. Simply pass the integer count you expect and the function will raise immediately if the file does not match, without requiring any changes to the source.

Get Started

Architecture

ETL Pipeline

Analytics & Insights

Forecasting Models

Validating Generated CSV Datasets Before Loading to Postgres

`validate_csv()` — `etl/validation.py`

Full source

Parameters

Return value

Checks performed

Usage in `run_pipeline.py`

Calling it programmatically

Build docs developers (and LLMs) love

Get Started

Architecture

ETL Pipeline

Analytics & Insights

Forecasting Models

Documentation Index

​validate_csv() — etl/validation.py

​Full source

​Parameters

​Return value

​Checks performed

​Usage in run_pipeline.py

​Calling it programmatically

Build docs developers (and LLMs) love

`validate_csv()` — `etl/validation.py`

Full source

Parameters

Return value

Checks performed

Usage in `run_pipeline.py`

Calling it programmatically