Strategy Validation: Walk-Forward Analysis and CPCV

A compelling in-sample backtest is not enough to trust a strategy with real capital. Overfitting — fitting the model’s parameters (or the rule tree’s thresholds) to the noise in historical data rather than genuine signal — is the single largest failure mode in quantitative research. The Validation Engine enforces a two-stage statistical gate between the Research Layer and the Portfolio Construction Layer: Walk-Forward Analysis measures out-of-sample performance stability across rolling time windows, and Combinatorial Purged Cross-Validation (CPCV) provides a full distribution of OOS paths and computes the Probability of Backtest Overfitting. A strategy must pass configured thresholds on both before its status can move to validated and then promoted.

Why In-Sample Backtests Aren’t Enough

Overfitting

A model or rule tree optimised on a fixed historical window can memorise idiosyncratic patterns that will not repeat. In-sample Sharpe ratios routinely overstate live performance by 2–5×.

Look-Ahead Bias

Even without intentional peeking, overlapping forward-return labels and serial correlation between adjacent bars can leak test information into training — inflating CV scores without a purging step.

Any strategy with an in-sample Sharpe above 2.0 should be treated with scepticism until walk-forward OOS Sharpe is confirmed. High IS/OOS ratios (the overfitting_score) are the primary signal of data snooping.

Walk-Forward Analysis

Walk-Forward Analysis simulates how a strategy would have been deployed and periodically retrained in a live environment. Instead of training once on all history, the engine creates multiple sequential folds and measures performance only on each fold’s out-of-sample window.

Window Types

Rolling

Fixed-size train window slides forward. Each fold trains on the same number of bars. Prevents over-representation of early market regimes.

[====TRAIN====][TEST] →
    [====TRAIN====][TEST] →

Expanding

Train window grows from the dataset start. Each fold uses all available history up to that point — mimicking the natural data accumulation of live operation.

[=TRAIN=][TEST]
[==TRAIN==][TEST]
[===TRAIN===][TEST]

Anchored

Identical to expanding but the start anchor is fixed explicitly. Useful when you want to control the exact lookback origin regardless of dataset length.

Walk-Forward Configuration

// POST /api/validation/walk-forward
{
  "strategy_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "plugin_key": "ml.xgboost",
  "model_params": { "max_depth": 6, "n_estimators": 200, "learning_rate": 0.05 },
  "feature_ids": ["f1000000-0000-0000-0000-000000000001"],
  "symbol": "AAPL",
  "timeframe": "1d",
  "start_date": "2018-01-01T00:00:00Z",
  "end_date": "2024-01-01T00:00:00Z",
  "target_horizon": 1,
  "wf_config": {
    "method": "rolling",
    "n_splits": 5,
    "test_size": 0.2,
    "min_train_size": 0.3,
    "gap_bars": 5,
    "refit": true
  },
  "validation_config": {
    "min_sharpe": 0.3,
    "max_drawdown": 0.25,
    "profitable_fold_ratio": 0.5,
    "max_overfit_ratio": 3.0
  }
}

WalkForwardConfig parameters:

Parameter	Default	Description
`method`	`"rolling"`	Window type: `rolling`, `expanding`, or `anchored`
`n_splits`	`5`	Number of folds to generate
`test_size`	`0.2`	Fraction of total dataset length per test fold
`min_train_size`	`0.3`	Minimum train fraction (rolling/expanding only)
`gap_bars`	`0`	Bars of gap between train end and test start to prevent leakage
`refit`	`true`	Re-train the model from scratch on each fold’s train set

Per-Fold Results

For each fold the engine records both in-sample (IS) CV metrics and out-of-sample (OOS) metrics:

{
  "fold_idx": 2,
  "train_start": 0,
  "train_end": 756,
  "test_start": 756,
  "test_end": 1008,
  "is_metrics": {
    "mean_directional_accuracy": 0.58,
    "mean_mse": 0.0041
  },
  "oos_metrics": {
    "total_return": 0.087,
    "perf_cagr": 0.054,
    "perf_sharpe_ratio": 0.81,
    "perf_sortino_ratio": 1.12,
    "risk_max_drawdown": 0.091,
    "risk_volatility_annualised": 0.127
  }
}

Walk-Forward Aggregate

The engine aggregates OOS metrics across all folds:

{
  "mean_oos_sharpe": 0.74,
  "std_oos_sharpe": 0.22,
  "min_oos_sharpe": 0.41,
  "mean_oos_return": 0.064,
  "mean_oos_max_drawdown": 0.103,
  "worst_oos_drawdown": 0.184,
  "n_profitable_folds": 4,
  "n_folds": 5
}

Two stability flags are also computed:

is_sharpe_stable — true if std_oos_sharpe / |mean_oos_sharpe| < 0.5. A high coefficient of variation suggests regime-dependent performance that will be unreliable live.
is_profitable_oos — true if mean_oos_sharpe > 0. The strategy must have positive risk-adjusted OOS returns on average.
overfitting_score — IS directional accuracy / |OOS Sharpe|. Values much greater than 1.0 indicate the model is fitting training noise.

Combinatorial Purged Cross-Validation (CPCV)

Walk-forward produces one OOS path — a single sequential sequence of test folds. CPCV, as described by Marcos Lopez de Prado in Advances in Financial Machine Learning (Chapter 12), generates all C(n, k) combinations of k test folds from n total folds, producing many distinct OOS paths. The distribution of Sharpe ratios across these paths enables two additional statistics:

Probability of Backtest Overfitting (PBO) — the fraction of CPCV paths whose OOS Sharpe is below the median IS Sharpe. A PBO above 0.5 means more than half the paths underperformed in OOS.
Deflated Sharpe Ratio (DSR) — the OOS Sharpe adjusted downward for the number of trials (paths) evaluated, analogous to the Bonferroni correction in hypothesis testing.

Why CPCV Catches What Walk-Forward Misses

CPCV applies two additional safeguards against label leakage that walk-forward does not enforce:

Purging

When the target y[t] is a target_horizon-bar forward return, label at bar t overlaps with bars t+1 … t+horizon-1. Any training sample whose label window touches the test window is purged (removed from training).

Embargo

Bars immediately following the test window are embargoed from training to eliminate momentum/autocorrelation leakage. The embargo length is embargo_pct × n_samples bars after each test fold end.

CPCV Configuration

// POST /api/validation/cpcv
{
  "strategy_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "plugin_key": "ml.xgboost",
  "model_params": { "max_depth": 6, "n_estimators": 200 },
  "feature_ids": ["f1000000-0000-0000-0000-000000000001"],
  "symbol": "AAPL",
  "timeframe": "1d",
  "start_date": "2018-01-01T00:00:00Z",
  "end_date": "2024-01-01T00:00:00Z",
  "cpcv_config": {
    "n_splits": 6,
    "n_test_splits": 2,
    "embargo_pct": 0.01,
    "purge": true,
    "target_horizon": 1
  },
  "max_pbo": 0.6,
  "min_deflated_sharpe": 0.1,
  "min_sharpe": 0.2
}

CPCVConfig parameters:

Parameter	Default	Description
`n_splits`	`6`	Total folds `n`. Number of CPCV paths = `C(n, k)`
`n_test_splits`	`2`	Test folds per combination `k`. Must be `< n_splits`
`embargo_pct`	`0.01`	Fraction of total samples to embargo after each test fold
`purge`	`true`	Enable label-overlap purging
`target_horizon`	`1`	Forward-return horizon used for purge calculation

With n_splits=6 and n_test_splits=2 the engine generates C(6,2) = 15 distinct train/test splits, producing 15 independent OOS performance paths.

Validation Gates

Both the Walk-Forward and CPCV engines enforce configurable promotion gates. All gates must pass for the strategy status to advance to validated.

Walk-Forward Gates

Gate	Config Field	Default	Pass Condition
OOS Sharpe	`min_sharpe`	`0.3`	`mean_oos_sharpe ≥ min_sharpe`
OOS Max Drawdown	`max_drawdown`	`0.25`	`mean_oos_max_drawdown ≤ max_drawdown`
Profitable Folds	`profitable_fold_ratio`	`0.5`	`n_profitable_folds / n_folds ≥ ratio`
Overfitting Score	`max_overfit_ratio`	`3.0`	`overfitting_score ≤ max_overfit_ratio`

CPCV Gates

Gate	Config Field	Default	Pass Condition
Mean OOS Sharpe	`min_sharpe`	`0.2`	`mean_oos_sharpe ≥ min_sharpe`
PBO	`max_pbo`	`0.6`	`pbo ≤ max_pbo`
Deflated Sharpe	`min_deflated_sharpe`	`0.1`	`deflated_sharpe ≥ min_deflated_sharpe`

ValidationResult

The response from both validation endpoints shares a common structure:

{
  "passed": true,
  "gate_results": {
    "oos_sharpe": {
      "passed": true,
      "value": 0.74,
      "threshold": 0.3
    },
    "oos_max_drawdown": {
      "passed": true,
      "value": 0.103,
      "threshold": 0.25
    },
    "profitable_folds": {
      "passed": true,
      "value": 0.8,
      "threshold": 0.5
    },
    "overfit_ratio": {
      "passed": true,
      "value": 1.82,
      "threshold": 3.0
    }
  },
  "aggregate": {
    "mean_oos_sharpe": 0.74,
    "std_oos_sharpe": 0.22,
    "n_profitable_folds": 4,
    "n_folds": 5,
    "overfitting_score": 1.82
  },
  "mlflow_run_id": "abc123def456",
  "fold_details": [ { "...": "per-fold IS and OOS metrics" } ]
}

When passed is true, the engine automatically transitions Strategy.status to validated. When passed is false, gate_results tells you exactly which thresholds were missed and by how much — enabling targeted remediation (e.g. relaxing signal thresholds, adding features, or increasing the training window).

MLflow Integration

Every validation run is logged to MLflow as a standalone experiment:

Parameters: plugin_key, n_splits, test_size, gap_bars, refit, model hyperparameters
Metrics: all aggregate OOS metrics, per-gate pass/fail as floats (1.0 / 0.0), overfitting_score, validation_passed
Artifacts: per-fold metrics as a JSON file

The mlflow_run_id in the response links directly to the MLflow experiment UI for drill-down exploration.

Async Validation

Validation runs can be long for multi-year datasets with many folds. Pass async_mode: true to dispatch to a Celery worker:

POST /api/validation/walk-forward?async=true

The endpoint returns a task_id immediately. Poll GET /api/validation/tasks/{task_id} for status and the final ValidationResult on completion.

API Reference

For the complete endpoint reference — including per-fold result retrieval, gate configuration overrides, and CPCV path export — see the Validation API.

Get Started

Core Concepts

Guides

Strategy Validation: Walk-Forward Analysis and CPCV

Why In-Sample Backtests Aren’t Enough

Overfitting

Look-Ahead Bias

Walk-Forward Analysis

Window Types

Rolling

Expanding

Anchored

Walk-Forward Configuration

Per-Fold Results

Walk-Forward Aggregate

Combinatorial Purged Cross-Validation (CPCV)

Why CPCV Catches What Walk-Forward Misses

Purging

Embargo

CPCV Configuration

Validation Gates

Walk-Forward Gates

CPCV Gates

ValidationResult

MLflow Integration

Async Validation

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Documentation Index

​Why In-Sample Backtests Aren’t Enough

Overfitting

Look-Ahead Bias

​Walk-Forward Analysis

​Window Types

Rolling

Expanding

Anchored

​Walk-Forward Configuration

​Per-Fold Results

​Walk-Forward Aggregate

​Combinatorial Purged Cross-Validation (CPCV)

​Why CPCV Catches What Walk-Forward Misses

Purging

Embargo

​CPCV Configuration

​Validation Gates

​Walk-Forward Gates

​CPCV Gates

​ValidationResult

​MLflow Integration

​Async Validation

​API Reference

Build docs developers (and LLMs) love

Why In-Sample Backtests Aren’t Enough

Walk-Forward Analysis

Window Types

Walk-Forward Configuration

Per-Fold Results

Walk-Forward Aggregate

Combinatorial Purged Cross-Validation (CPCV)

Why CPCV Catches What Walk-Forward Misses

CPCV Configuration

Validation Gates

Walk-Forward Gates

CPCV Gates

ValidationResult

MLflow Integration

Async Validation

API Reference