Synthetic Data Generation for Fini Marketing Intelligence

The ETL pipeline begins by generating three interdependent synthetic datasets that model real-world candy retail behaviour. All three scripts use random.seed(42), guaranteeing that every run produces identical data — essential for reproducible analytics, model training, and dashboard development. Products are created first, then customers, and finally sales (which reference both). The outputs land in data/raw/ as CSV files ready for validation and loading.

All three generators call random.seed(42) at the top of their script. This fixed seed ensures that the exact same products, customers, and sales records are produced on every run, making the full pipeline deterministic and safe to share across development environments.

Products
Customers
Sales

Products Generator

etl/generate_products.py defines a catalogue of 20 Fini candy products, each with a category, a seasonal affinity, a randomly derived cost and price, and a randomised launch date.

Product catalogue

The 20 products are hardcoded as a list of (name, category, season) tuples, ensuring the catalogue never changes between runs:

Product	Category	Season
Tropical Mix	Gummies	All Year
Sour Cola Bottles	Gummies	All Year
Watermelon Slices	Gummies	Summer
Strawberry Belts	Belts	All Year
Rainbow Belts	Belts	All Year
Halloween Mix	Seasonal	Halloween
Christmas Mix	Seasonal	Christmas
Marshmallow Twist	Marshmallow	All Year
Watermelon Marshmallow	Marshmallow	Summer
Regaliz Twist	Licorice	All Year
Sour Worms	Gummies	All Year
Bubblegum Bottles	Gummies	Summer
Candy Bananas	Foam	All Year
Jelly Hearts	Gummies	Valentine
Mini Burgers	Novelty	All Year
Fried Eggs	Foam	All Year
Sharks	Gummies	Summer
Fruit Rings	Gummies	All Year
Spooky Teeth	Seasonal	Halloween
Snowflakes	Seasonal	Christmas

Pricing logic

For each product, unit_cost is drawn uniformly from €0.20–€1.00, a markup multiplier is drawn from 1.8–2.8, and unit_price is their product rounded to two decimal places:

unit_cost = round(random.uniform(0.20, 1.00), 2)
markup = random.uniform(1.8, 2.8)
unit_price = round(unit_cost * markup, 2)

Launch dates are constructed from a randomly selected year (2022, 2023, 2024, or 2025) plus a random month (1–12) and day (1–28):

launch_year = random.choice([2022, 2023, 2024, 2025])
launch_month = random.randint(1, 12)
launch_day = random.randint(1, 28)

Full source

from pathlib import Path
import pandas as pd
import random

random.seed(42)

products = [
    ("Tropical Mix", "Gummies", "All Year"),
    ("Sour Cola Bottles", "Gummies", "All Year"),
    ("Watermelon Slices", "Gummies", "Summer"),
    ("Strawberry Belts", "Belts", "All Year"),
    ("Rainbow Belts", "Belts", "All Year"),
    ("Halloween Mix", "Seasonal", "Halloween"),
    ("Christmas Mix", "Seasonal", "Christmas"),
    ("Marshmallow Twist", "Marshmallow", "All Year"),
    ("Watermelon Marshmallow", "Marshmallow", "Summer"),
    ("Regaliz Twist", "Licorice", "All Year"),
    ("Sour Worms", "Gummies", "All Year"),
    ("Bubblegum Bottles", "Gummies", "Summer"),
    ("Candy Bananas", "Foam", "All Year"),
    ("Jelly Hearts", "Gummies", "Valentine"),
    ("Mini Burgers", "Novelty", "All Year"),
    ("Fried Eggs", "Foam", "All Year"),
    ("Sharks", "Gummies", "Summer"),
    ("Fruit Rings", "Gummies", "All Year"),
    ("Spooky Teeth", "Seasonal", "Halloween"),
    ("Snowflakes", "Seasonal", "Christmas"),
]

data = []

for idx, (name, category, season) in enumerate(products, start=1):

    unit_cost = round(random.uniform(0.20, 1.00), 2)
    markup = random.uniform(1.8, 2.8)
    unit_price = round(unit_cost * markup, 2)
    launch_year = random.choice([2022, 2023, 2024, 2025])
    launch_month = random.randint(1, 12)
    launch_day = random.randint(1, 28)

    data.append({
        "product_id": idx,
        "product_name": name,
        "category": category,
        "season": season,
        "launch_date": f"{launch_year}-{launch_month:02d}-{launch_day:02d}",
        "unit_cost": unit_cost,
        "unit_price": unit_price
    })

df = pd.DataFrame(data)

output_path = Path("data/raw")
output_path.mkdir(parents=True, exist_ok=True)

df.to_csv(output_path / "products.csv", index=False)

Output: data/raw/products.csv — 20 rows, columns: product_id, product_name, category, season, launch_date, unit_cost, unit_price.

Customers Generator

etl/generate_customers.py synthesises 5,000 customer profiles by sampling from weighted demographic and behavioural distributions. Each customer is assigned a region, a preferred purchase channel, a purchase-frequency tier, an age group, and an average basket size that reflects their frequency tier.

Distributions

Purchase Frequency

Assigned via weighted sampling:

Low — 40%
Medium — 40%
High — 20%

Age Groups

Assigned via weighted sampling:

18–24 — 20%
25–34 — 30%
35–44 — 25%
45–54 — 15%
55+ — 10%

Regions

Equal probability across five regions:

North, South, East, West, Center

Channels

Equal probability across four channels:

Supermarket, E-commerce, Convenience Store, Hypermarket

Average ticket by frequency

The avg_ticket field is drawn from a range that corresponds to the customer’s purchase-frequency tier:

Frequency	Range
Low	€3 – €8
Medium	€8 – €15
High	€15 – €30

if frequency == "Low":
    avg_ticket = round(random.uniform(3, 8), 2)
elif frequency == "Medium":
    avg_ticket = round(random.uniform(8, 15), 2)
else:
    avg_ticket = round(random.uniform(15, 30), 2)

Full source

from pathlib import Path
import random
import pandas as pd

random.seed(42)

N_CUSTOMERS = 5000

regions = ["North", "South", "East", "West", "Center"]

channels = ["Supermarket", "E-commerce", "Convenience Store", "Hypermarket"]

frequencies = {"Low": 0.40, "Medium": 0.40, "High": 0.20}

age_groups = {
    "18-24": 0.20,
    "25-34": 0.30,
    "35-44": 0.25,
    "45-54": 0.15,
    "55+": 0.10
}

data = []

for customer_id in range(1, N_CUSTOMERS + 1):

    age_group = random.choices(
        list(age_groups.keys()),
        weights=list(age_groups.values())
    )[0]

    frequency = random.choices(
        list(frequencies.keys()),
        weights=list(frequencies.values())
    )[0]

    region = random.choice(regions)
    channel = random.choice(channels)

    if frequency == "Low":
        avg_ticket = round(random.uniform(3, 8), 2)
    elif frequency == "Medium":
        avg_ticket = round(random.uniform(8, 15), 2)
    else:
        avg_ticket = round(random.uniform(15, 30), 2)

    data.append({
        "customer_id": customer_id,
        "age_group": age_group,
        "region": region,
        "preferred_channel": channel,
        "purchase_frequency": frequency,
        "avg_ticket": avg_ticket
    })

df = pd.DataFrame(data)

output_dir = Path("data/raw")
output_dir.mkdir(parents=True, exist_ok=True)

df.to_csv(output_dir / "customers.csv", index=False)

Output: data/raw/customers.csv — 5,000 rows, columns: customer_id, age_group, region, preferred_channel, purchase_frequency, avg_ticket.

Sales Generator

etl/generate_sales.py is the most complex generator. It reads both products.csv and customers.csv, then produces 100,000 sale records spanning 2023-01-01 to 2025-12-31. It applies weighted customer sampling, product seasonality, discount distributions, and economic metric calculations.

Weighted customer sampling

High-frequency customers are deliberately over-represented in the sales data to simulate realistic purchase behaviour. Weights are assigned per tier and then normalised:

frequency_weights = {"Low": 1, "Medium": 2, "High": 4}

customer_weights = customers["purchase_frequency"].map(frequency_weights)
customer_weights = customer_weights / customer_weights.sum()

This means a High-frequency customer is four times more likely to appear in any given sale than a Low-frequency customer.

Seasonality logic

After picking a base sale date, the generator may snap the date to the product’s seasonal window:

if season == "Halloween":
    if random.random() < 0.80:
        seasonal_date = date(sale_year, 10, random.randint(1, 31))
        if seasonal_date >= launch_date:
            sale_date = seasonal_date

elif season == "Christmas":
    if random.random() < 0.80:
        seasonal_date = date(sale_year, 12, random.randint(1, 31))
        if seasonal_date >= launch_date:
            sale_date = seasonal_date

elif season == "Summer":
    if random.random() < 0.70:
        seasonal_date = date(sale_year, random.choice([6, 7, 8]), random.randint(1, 28))
        if seasonal_date >= launch_date:
            sale_date = seasonal_date

Season	Probability	Target months
Halloween	80%	October
Christmas	80%	December
Summer	70%	June, July, August

Discounts and units

discount = random.choices(
    [0.00, 0.10, 0.20, 0.30],
    weights=[70, 15, 10, 5]
)[0]

units = random.randint(1, 3)

# Promotions increase volume
if discount > 0:
    units += random.randint(0, 2)

Discount	Weight
0%	70%
10%	15%
20%	10%
30%	5%

Economic metrics

All financial fields are derived deterministically from the product’s cost/price and the sale’s units and discount:

revenue = units * product["unit_price"] * (1 - discount)
cost    = units * product["unit_cost"]
margin  = revenue - cost

Output: data/raw/sales.csv — 100,000 rows, columns: sale_id, sale_date, customer_id, product_id, units, discount, revenue, cost, margin.

Running the generators

Run the scripts in order — sales depends on the product and customer CSVs already existing:

Generate products

python etl/generate_products.py

Writes data/raw/products.csv (20 rows).

Generate customers

python etl/generate_customers.py

Writes data/raw/customers.csv (5,000 rows).

Generate sales

python etl/generate_sales.py

Reads both CSVs, then writes data/raw/sales.csv (100,000 rows). Progress is printed every 10,000 sales.

Alternatively, use the full pipeline runner which executes all steps automatically:

python run_pipeline.py

Get Started

Architecture

ETL Pipeline

Analytics & Insights

Forecasting Models

Synthetic Data Generation for Fini Marketing Intelligence

Products Generator

Product catalogue

Pricing logic

Full source

Customers Generator

Distributions

Purchase Frequency

Age Groups

Regions

Channels

Average ticket by frequency

Full source

Sales Generator

Weighted customer sampling

Seasonality logic

Discounts and units

Economic metrics

Running the generators

Build docs developers (and LLMs) love

Get Started

Architecture

ETL Pipeline

Analytics & Insights

Forecasting Models

Documentation Index

​Products Generator

​Product catalogue

​Pricing logic

​Full source

​Customers Generator

​Distributions

Purchase Frequency

Age Groups

Regions

Channels

​Average ticket by frequency

​Full source

​Sales Generator

​Weighted customer sampling

​Seasonality logic

​Discounts and units

​Economic metrics

​Running the generators

Build docs developers (and LLMs) love

Products Generator

Product catalogue

Pricing logic

Full source

Customers Generator

Distributions

Average ticket by frequency

Full source

Sales Generator

Weighted customer sampling

Seasonality logic

Discounts and units

Economic metrics

Running the generators