Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/andresshm/fini-marketing-intelligence/llms.txt

Use this file to discover all available pages before exploring further.

The ETL pipeline begins by generating three interdependent synthetic datasets that model real-world candy retail behaviour. All three scripts use random.seed(42), guaranteeing that every run produces identical data — essential for reproducible analytics, model training, and dashboard development. Products are created first, then customers, and finally sales (which reference both). The outputs land in data/raw/ as CSV files ready for validation and loading.
All three generators call random.seed(42) at the top of their script. This fixed seed ensures that the exact same products, customers, and sales records are produced on every run, making the full pipeline deterministic and safe to share across development environments.

Products Generator

etl/generate_products.py defines a catalogue of 20 Fini candy products, each with a category, a seasonal affinity, a randomly derived cost and price, and a randomised launch date.

Product catalogue

The 20 products are hardcoded as a list of (name, category, season) tuples, ensuring the catalogue never changes between runs:
ProductCategorySeason
Tropical MixGummiesAll Year
Sour Cola BottlesGummiesAll Year
Watermelon SlicesGummiesSummer
Strawberry BeltsBeltsAll Year
Rainbow BeltsBeltsAll Year
Halloween MixSeasonalHalloween
Christmas MixSeasonalChristmas
Marshmallow TwistMarshmallowAll Year
Watermelon MarshmallowMarshmallowSummer
Regaliz TwistLicoriceAll Year
Sour WormsGummiesAll Year
Bubblegum BottlesGummiesSummer
Candy BananasFoamAll Year
Jelly HeartsGummiesValentine
Mini BurgersNoveltyAll Year
Fried EggsFoamAll Year
SharksGummiesSummer
Fruit RingsGummiesAll Year
Spooky TeethSeasonalHalloween
SnowflakesSeasonalChristmas

Pricing logic

For each product, unit_cost is drawn uniformly from €0.20–€1.00, a markup multiplier is drawn from 1.8–2.8, and unit_price is their product rounded to two decimal places:
unit_cost = round(random.uniform(0.20, 1.00), 2)
markup = random.uniform(1.8, 2.8)
unit_price = round(unit_cost * markup, 2)
Launch dates are constructed from a randomly selected year (2022, 2023, 2024, or 2025) plus a random month (1–12) and day (1–28):
launch_year = random.choice([2022, 2023, 2024, 2025])
launch_month = random.randint(1, 12)
launch_day = random.randint(1, 28)

Full source

from pathlib import Path
import pandas as pd
import random

random.seed(42)

products = [
    ("Tropical Mix", "Gummies", "All Year"),
    ("Sour Cola Bottles", "Gummies", "All Year"),
    ("Watermelon Slices", "Gummies", "Summer"),
    ("Strawberry Belts", "Belts", "All Year"),
    ("Rainbow Belts", "Belts", "All Year"),
    ("Halloween Mix", "Seasonal", "Halloween"),
    ("Christmas Mix", "Seasonal", "Christmas"),
    ("Marshmallow Twist", "Marshmallow", "All Year"),
    ("Watermelon Marshmallow", "Marshmallow", "Summer"),
    ("Regaliz Twist", "Licorice", "All Year"),
    ("Sour Worms", "Gummies", "All Year"),
    ("Bubblegum Bottles", "Gummies", "Summer"),
    ("Candy Bananas", "Foam", "All Year"),
    ("Jelly Hearts", "Gummies", "Valentine"),
    ("Mini Burgers", "Novelty", "All Year"),
    ("Fried Eggs", "Foam", "All Year"),
    ("Sharks", "Gummies", "Summer"),
    ("Fruit Rings", "Gummies", "All Year"),
    ("Spooky Teeth", "Seasonal", "Halloween"),
    ("Snowflakes", "Seasonal", "Christmas"),
]

data = []

for idx, (name, category, season) in enumerate(products, start=1):

    unit_cost = round(random.uniform(0.20, 1.00), 2)
    markup = random.uniform(1.8, 2.8)
    unit_price = round(unit_cost * markup, 2)
    launch_year = random.choice([2022, 2023, 2024, 2025])
    launch_month = random.randint(1, 12)
    launch_day = random.randint(1, 28)

    data.append({
        "product_id": idx,
        "product_name": name,
        "category": category,
        "season": season,
        "launch_date": f"{launch_year}-{launch_month:02d}-{launch_day:02d}",
        "unit_cost": unit_cost,
        "unit_price": unit_price
    })

df = pd.DataFrame(data)

output_path = Path("data/raw")
output_path.mkdir(parents=True, exist_ok=True)

df.to_csv(output_path / "products.csv", index=False)
Output: data/raw/products.csv — 20 rows, columns: product_id, product_name, category, season, launch_date, unit_cost, unit_price.

Running the generators

Run the scripts in order — sales depends on the product and customer CSVs already existing:
1

Generate products

python etl/generate_products.py
Writes data/raw/products.csv (20 rows).
2

Generate customers

python etl/generate_customers.py
Writes data/raw/customers.csv (5,000 rows).
3

Generate sales

python etl/generate_sales.py
Reads both CSVs, then writes data/raw/sales.csv (100,000 rows). Progress is printed every 10,000 sales.
Alternatively, use the full pipeline runner which executes all steps automatically:
python run_pipeline.py

Build docs developers (and LLMs) love