Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/andresshm/fini-marketing-intelligence/llms.txt

Use this file to discover all available pages before exploring further.

Unlike the Prophet models, which decompose a time series into trend and seasonality components, the XGBoost model treats revenue forecasting as a standard supervised learning problem. Each day becomes a row of engineered features — calendar signals, seasonal event flags, and autoregressive lags — and a gradient-boosted tree ensemble learns the non-linear mapping from those features to daily revenue. This approach is particularly powerful when demand is influenced by complex interactions (e.g., a summer weekend close to a promotional event) that additive decomposition models cannot represent without explicit manual encoding. The trade-off is that XGBoost cannot extrapolate beyond the observed date range: it produces predictions only for the 90-day test period, not a future forecast.

Feature Engineering

Two functions build the 13 features used for training and inference. Both accept a DataFrame with a ds date column and a y revenue column, and both return a copy.

Calendar Features

add_calendar_features derives eight date-based columns, including seasonal event flags consistent with the Enriched model.
def add_calendar_features(df):

    df = df.copy()

    df["year"]        = df["ds"].dt.year
    df["month"]       = df["ds"].dt.month
    df["quarter"]     = df["ds"].dt.quarter
    df["day_of_week"] = df["ds"].dt.dayofweek

    df["is_weekend"] = (
        df["day_of_week"] >= 5
    ).astype(int)

    df["halloween"] = (
        df["month"] == 10
    ).astype(int)

    df["christmas"] = (
        df["month"] == 12
    ).astype(int)

    df["summer"] = (
        df["month"].isin([6, 7, 8])
    ).astype(int)

    return df
FeatureSourceDescription
yeards.dt.yearCalendar year (captures long-run trend)
monthds.dt.monthMonth of year (1–12)
quarterds.dt.quarterQuarter of year (1–4)
day_of_weekds.dt.dayofweekDay index, Monday = 0, Sunday = 6
is_weekenddayofweek >= 51 on Saturday/Sunday, 0 otherwise
halloweenmonth == 101 throughout October
christmasmonth == 121 throughout December
summermonth in [6, 7, 8]1 in June, July, August

Lag Features

add_lag_features creates five autoregressive columns that give the model a window into recent revenue history. All lags are shifted by at least one day to prevent data leakage — the model never sees same-day revenue when predicting.
def add_lag_features(df):

    df = df.copy()

    df["lag_1"]  = df["y"].shift(1)
    df["lag_7"]  = df["y"].shift(7)
    df["lag_30"] = df["y"].shift(30)

    df["rolling_mean_7"] = (
        df["y"]
        .shift(1)
        .rolling(7)
        .mean()
    )

    df["rolling_mean_30"] = (
        df["y"]
        .shift(1)
        .rolling(30)
        .mean()
    )

    return df
FeatureWindowDescription
lag_11 dayRevenue from the immediately preceding day
lag_77 daysRevenue from the same day last week
lag_3030 daysRevenue from approximately the same day last month
rolling_mean_77 daysRolling average of the prior 7 days (shifted to avoid leakage)
rolling_mean_3030 daysRolling average of the prior 30 days (shifted to avoid leakage)
After add_lag_features is applied, the first 30 rows of the daily DataFrame will contain NaN because there is no prior history to populate lag_30 and rolling_mean_30. The pipeline calls daily = daily.dropna() before the train/test split to discard these incomplete rows. This is expected behaviour — no data is lost from the evaluation window.

Full Feature List

The model trains and predicts on exactly 13 features in the following order:
features = [
    "year",
    "month",
    "quarter",
    "day_of_week",
    "is_weekend",
    "halloween",
    "christmas",
    "summer",
    "lag_1",
    "lag_7",
    "lag_30",
    "rolling_mean_7",
    "rolling_mean_30"
]

Model Configuration

The XGBRegressor is initialised with conservative hyperparameters that balance model capacity against overfitting on a daily revenue dataset of moderate size.
model = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

model.fit(X_train, y_train)
HyperparameterValueEffect
n_estimators300Number of boosting rounds
learning_rate0.05Step size shrinkage; lower values require more trees but generalise better
max_depth4Maximum tree depth; limits model complexity
subsample0.8Fraction of training rows sampled per tree; reduces overfitting
colsample_bytree0.8Fraction of features sampled per tree; improves robustness
random_state42Seed for reproducibility
If MAPE remains above 20 %, try increasing n_estimators to 500 while lowering learning_rate to 0.01, or reducing max_depth to 3 to limit overfitting. Use cross-validation with a time-series aware splitter (e.g., sklearn.model_selection.TimeSeriesSplit) rather than random k-fold, since the features include lags that depend on temporal ordering.

Train / Test Split and Prediction

The split follows the same 90-day hold-out rule as the Prophet models. After fitting on the training set, predictions are generated for the test features only.
cutoff_date = daily["ds"].max() - pd.Timedelta(days=90)

train = daily[daily["ds"] <= cutoff_date]
test  = daily[daily["ds"] >  cutoff_date]

X_train, y_train = train[features], train["y"]
X_test,  y_test  = test[features],  test["y"]

predictions = model.predict(X_test)
XGBoost produces predictions only for the 90-day test set. Unlike the Prophet models, it does not generate a forward-looking forecast beyond the last observed date. This is a fundamental characteristic of lag-based supervised models: predicting day N+1 requires knowing day N’s revenue, which is not available for truly future dates without an iterative multi-step forecasting strategy.

Results

{
    "MAE": 173.43033307562933,
    "RMSE": 232.3515343448317,
    "MAPE": 24.525904122380208
}
MetricValueInterpretation
MAE173.43 €/dayHigher absolute error than Prophet; driven by cold-start sensitivity in lag features
RMSE232.35 €/dayLarge squared error on peak revenue days
MAPE24.53 %Lowest proportional error of the three models — XGBoost is relatively accurate across the revenue range
XGBoost achieves the lowest MAPE across the three models, meaning it is the most proportionally accurate on a day-to-day basis. Its higher MAE and RMSE compared to Prophet reflect sensitivity to the initial lag window — early test-set rows rely on lag values from the tail of training, which may carry different distributional characteristics than a warm mid-series window.

Output Files

FileColumnsDescription
outputs/forecast_xgboost.csvds, y, yhatTest-period actuals vs. predictions; no future rows
outputs/metrics_xgboost.jsonMAE, RMSE, MAPEScalar evaluation metrics for the 90-day test window

Build docs developers (and LLMs) love