Documentation Index
Fetch the complete documentation index at: https://mintlify.com/andresshm/fini-marketing-intelligence/llms.txt
Use this file to discover all available pages before exploring further.
Unlike the Prophet models, which decompose a time series into trend and seasonality components, the XGBoost model treats revenue forecasting as a standard supervised learning problem. Each day becomes a row of engineered features — calendar signals, seasonal event flags, and autoregressive lags — and a gradient-boosted tree ensemble learns the non-linear mapping from those features to daily revenue. This approach is particularly powerful when demand is influenced by complex interactions (e.g., a summer weekend close to a promotional event) that additive decomposition models cannot represent without explicit manual encoding. The trade-off is that XGBoost cannot extrapolate beyond the observed date range: it produces predictions only for the 90-day test period, not a future forecast.
Feature Engineering
Two functions build the 13 features used for training and inference. Both accept a DataFrame with a ds date column and a y revenue column, and both return a copy.
Calendar Features
add_calendar_features derives eight date-based columns, including seasonal event flags consistent with the Enriched model.
def add_calendar_features(df):
df = df.copy()
df["year"] = df["ds"].dt.year
df["month"] = df["ds"].dt.month
df["quarter"] = df["ds"].dt.quarter
df["day_of_week"] = df["ds"].dt.dayofweek
df["is_weekend"] = (
df["day_of_week"] >= 5
).astype(int)
df["halloween"] = (
df["month"] == 10
).astype(int)
df["christmas"] = (
df["month"] == 12
).astype(int)
df["summer"] = (
df["month"].isin([6, 7, 8])
).astype(int)
return df
| Feature | Source | Description |
|---|
year | ds.dt.year | Calendar year (captures long-run trend) |
month | ds.dt.month | Month of year (1–12) |
quarter | ds.dt.quarter | Quarter of year (1–4) |
day_of_week | ds.dt.dayofweek | Day index, Monday = 0, Sunday = 6 |
is_weekend | dayofweek >= 5 | 1 on Saturday/Sunday, 0 otherwise |
halloween | month == 10 | 1 throughout October |
christmas | month == 12 | 1 throughout December |
summer | month in [6, 7, 8] | 1 in June, July, August |
Lag Features
add_lag_features creates five autoregressive columns that give the model a window into recent revenue history. All lags are shifted by at least one day to prevent data leakage — the model never sees same-day revenue when predicting.
def add_lag_features(df):
df = df.copy()
df["lag_1"] = df["y"].shift(1)
df["lag_7"] = df["y"].shift(7)
df["lag_30"] = df["y"].shift(30)
df["rolling_mean_7"] = (
df["y"]
.shift(1)
.rolling(7)
.mean()
)
df["rolling_mean_30"] = (
df["y"]
.shift(1)
.rolling(30)
.mean()
)
return df
| Feature | Window | Description |
|---|
lag_1 | 1 day | Revenue from the immediately preceding day |
lag_7 | 7 days | Revenue from the same day last week |
lag_30 | 30 days | Revenue from approximately the same day last month |
rolling_mean_7 | 7 days | Rolling average of the prior 7 days (shifted to avoid leakage) |
rolling_mean_30 | 30 days | Rolling average of the prior 30 days (shifted to avoid leakage) |
After add_lag_features is applied, the first 30 rows of the daily DataFrame will contain NaN because there is no prior history to populate lag_30 and rolling_mean_30. The pipeline calls daily = daily.dropna() before the train/test split to discard these incomplete rows. This is expected behaviour — no data is lost from the evaluation window.
Full Feature List
The model trains and predicts on exactly 13 features in the following order:
features = [
"year",
"month",
"quarter",
"day_of_week",
"is_weekend",
"halloween",
"christmas",
"summer",
"lag_1",
"lag_7",
"lag_30",
"rolling_mean_7",
"rolling_mean_30"
]
Model Configuration
The XGBRegressor is initialised with conservative hyperparameters that balance model capacity against overfitting on a daily revenue dataset of moderate size.
model = XGBRegressor(
n_estimators=300,
learning_rate=0.05,
max_depth=4,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
model.fit(X_train, y_train)
| Hyperparameter | Value | Effect |
|---|
n_estimators | 300 | Number of boosting rounds |
learning_rate | 0.05 | Step size shrinkage; lower values require more trees but generalise better |
max_depth | 4 | Maximum tree depth; limits model complexity |
subsample | 0.8 | Fraction of training rows sampled per tree; reduces overfitting |
colsample_bytree | 0.8 | Fraction of features sampled per tree; improves robustness |
random_state | 42 | Seed for reproducibility |
If MAPE remains above 20 %, try increasing n_estimators to 500 while lowering learning_rate to 0.01, or reducing max_depth to 3 to limit overfitting. Use cross-validation with a time-series aware splitter (e.g., sklearn.model_selection.TimeSeriesSplit) rather than random k-fold, since the features include lags that depend on temporal ordering.
Train / Test Split and Prediction
The split follows the same 90-day hold-out rule as the Prophet models. After fitting on the training set, predictions are generated for the test features only.
cutoff_date = daily["ds"].max() - pd.Timedelta(days=90)
train = daily[daily["ds"] <= cutoff_date]
test = daily[daily["ds"] > cutoff_date]
X_train, y_train = train[features], train["y"]
X_test, y_test = test[features], test["y"]
predictions = model.predict(X_test)
XGBoost produces predictions only for the 90-day test set. Unlike the Prophet models, it does not generate a forward-looking forecast beyond the last observed date. This is a fundamental characteristic of lag-based supervised models: predicting day N+1 requires knowing day N’s revenue, which is not available for truly future dates without an iterative multi-step forecasting strategy.
Results
{
"MAE": 173.43033307562933,
"RMSE": 232.3515343448317,
"MAPE": 24.525904122380208
}
| Metric | Value | Interpretation |
|---|
| MAE | 173.43 €/day | Higher absolute error than Prophet; driven by cold-start sensitivity in lag features |
| RMSE | 232.35 €/day | Large squared error on peak revenue days |
| MAPE | 24.53 % | Lowest proportional error of the three models — XGBoost is relatively accurate across the revenue range |
XGBoost achieves the lowest MAPE across the three models, meaning it is the most proportionally accurate on a day-to-day basis. Its higher MAE and RMSE compared to Prophet reflect sensitivity to the initial lag window — early test-set rows rely on lag values from the tail of training, which may carry different distributional characteristics than a warm mid-series window.
Output Files
| File | Columns | Description |
|---|
outputs/forecast_xgboost.csv | ds, y, yhat | Test-period actuals vs. predictions; no future rows |
outputs/metrics_xgboost.json | MAE, RMSE, MAPE | Scalar evaluation metrics for the 90-day test window |