The Full Menu: Every Out-of-Sample Test We Run to Counter Overfitting
Holdout, walk-forward, purged K-fold, PBO, Romano-Wolf, SPA, MC block-bootstrap, cluster stability, FDR. Each one catches a different overfitting failure mode. Skip any of them and the survivor is a coincidence.
There is no single out-of-sample test that catches every way a backtest can be overfit. There are at least nine, each one designed for a specific failure mode, and most quant retail platforms ship two of them. This is the full menu, named, with the failure mode each one is designed to catch and the reason skipping any of them lets that failure mode through. Names are taken directly from our open-source statistical gate engine; the math underneath each is the same math the literature has been writing about since the 1990s.
Why a Single Test Is Never Enough
Overfitting is not one phenomenon. It is at least four:
- Parameter overfitting. The chosen parameter values were tuned to in-sample noise.
- Selection overfitting. The chosen strategy was the best of many; the survivor is an order statistic, not a signal.
- Leakage. Information from the test period leaked into the training period through overlapping events, look-ahead features, or boundary effects.
- Path-dependence. The strategy survived this specific historical price path; on a path with the same statistical properties but a different ordering it would not.
Each failure mode needs its own test. Walk-forward catches (1) and partially (3). It does not catch (2) or (4). Running only walk-forward on a search through 12,800 indicator combinations gives you a survivor that is overwhelmingly likely to be a (2) artefact regardless of how clean the walk-forward result looks. The fix is to layer the tests so each failure mode has at least one gate dedicated to catching it.
The Nine Gates
Our engine ships nine OOS / robustness gates as configurable filters. Each is a separate Rust module under src/ipc/gates/; each has a published version() string and a params_schema() so the parameters are inspectable. Below is what each one does and why you cannot skip it.
1. Holdout (holdout.rs)
Reference: Pardo (1992). Failure mode caught: the simplest form of in-sample overfitting. Reserves the last N% of the calendar (default 30%) as a single unseen test set. The strategy is parameterised on the first 70%, then evaluated on the unseen tail with no further tuning allowed. Pass criterion: positive post-cost edge on the holdout window.
Holdout is the cheapest gate (0.01× cost multiplier) and the weakest. A strategy can survive holdout and still be a selection-effect ghost — it just happens to have a winning tail. Holdout is the gate you run first because it eliminates the obvious failures fastest. It is not the gate you ship on.
2. Walk-Forward (walk_forward.rs)
Reference: Pardo (1992); engine v2.0.0 switched to true rolling. Failure mode caught: parameter overfitting against a single time period. Splits the calendar into K folds (default 6), at each step trains on fold[k−1] and tests on fold[k]. Reports OOS win rate across K−1 rolling steps, plus the train→test sign-flip rate. A strategy whose Sharpe sign flips between training and testing in 40 %+ of folds is unstable regardless of the absolute OOS number.
The pass criterion is two-tailed: min_win_rate (default 0.60) and max_win_rate (default 1.0) and max_flip_rate (default 0.40). The upper bound on win rate exists because a strategy that wins 100 % of folds is more often a leak than a miracle. More on walk-forward design here.
3. Purged K-Fold (purged_kfold.rs)
Reference: López de Prado (2018), Advances in Financial Machine Learning, ch. 7. Failure mode caught: regime non-robustness and CV leakage. Standard K-fold with two corrections — purge of training events whose [entry, exit] window intersects the test fold, and an embargo of e days after the test fold (default 1 % of calendar). Pass criterion: positive edge in the majority of folds (win rate ≥ 0.50).
Purged K-fold complements walk-forward; it is not a substitute. The test folds are interspersed across the calendar, so a strategy that only worked in 2021 fails it; a strategy that worked uniformly across 2018–2024 passes it. Walk-forward measures forecasting; purged K-fold measures robustness.
4. PBO — Probability of Backtest Overfitting (pbo.rs)
Reference: Bailey, Borwein, López de Prado, Zhu (2017). Failure mode caught: selection overfitting. This is the gate that catches the failure walk-forward cannot.
The mechanism: split the calendar into S blocks (default S=14, capped at 20 because 2^20 = 1 M splits). Enumerate every C(S, S/2) way to split the blocks into a training half and a test half (3,432 splits at S=14). For each split: identify the in-sample winner across the strategy search; rank that winner out-of-sample. Count the splits where the in-sample winner ranks below the median out-of-sample. PBO is that count divided by the total number of splits.
If PBO ≤ 0.5, the search procedure is not systematically overfitting. If PBO ≥ 0.5, the in-sample winner is statistically a fluke regardless of how good its individual walk-forward result was. PBO is the most expensive gate (0.45× cost multiplier) because it is combinatorial. It is also the only gate that audits the search rather than the strategy. Skipping it is the single most common form of methodological malpractice in retail quant.
5. SPA — Superior Predictive Ability (spa.rs)
Reference: Hansen (2005), refining White (2000)’s Reality Check. Failure mode caught: "is the best strategy in this universe meaningfully better than the universe’s mean?" Tests the null hypothesis that no strategy in the search has positive expected loss against the benchmark, controlling for the multiplicity of the search.
SPA is the formal answer to "I tried 100 strategies and the best one had Sharpe 1.8; is that real?" The answer is a p-value that incorporates the size of the search. SPA, like Romano-Wolf below, is a multiple-comparison-with-benchmark test. It runs after the cascade has narrowed survivors but before any strategy is shipped.
6. Romano-Wolf Step-Down (romano_wolf.rs)
Reference: Romano & Wolf (2005). Failure mode caught: family-wise error rate across simultaneous strategy comparisons. A step-down procedure that controls the probability of any false rejection across a family of strategies, while being more powerful than Bonferroni and avoiding Bonferroni’s collapse for large families.
Romano-Wolf and SPA are mutually exclusive in the cascade configuration — the platform refuses to run both because they answer overlapping questions and stacking them induces a known statistical conflict. The runner’s preflight validator catches this and refuses the job at submit time rather than at execution.
7. Monte Carlo Block-Bootstrap (mc_block_bootstrap.rs)
Reference: Politis & Romano (1994), stationary bootstrap. Failure mode caught: path-dependence. Resamples the daily-return panel using geometric block lengths to preserve serial correlation, generates B synthetic histories, evaluates the strategy on each, and reports the empirical distribution.
This is the gate that answers "would the strategy have worked on a different but statistically-equivalent path?" If the strategy’s real Sharpe sits in the 95th percentile of the bootstrap distribution, the strategy is robust to path resampling. If it sits in the 60th percentile, the live equity curve was a lucky path and the same statistical setup gives middling results most of the time.
8. Cluster Stability (cluster_stability.rs)
Failure mode caught: regime-clustered overfitting. Many "winning" strategies are winners only in 2–3 contiguous months of the calendar; the rest of their P&L is flat or slightly negative. Cluster stability identifies whether the strategy’s edge is concentrated in a small number of calendar clusters or spread across the year. A high concentration is a tell that the strategy fit a regime, not a process.
9. FDR — Benjamini-Hochberg (fdr.rs)
Reference: Benjamini & Hochberg (1995). Failure mode caught: selection effect at the multiple-testing level. After the cascade has assigned a p-value to each survivor, BH-FDR adjusts those p-values for the size of the search. A strategy with raw p = 0.04 from a search of 12,800 candidates has an FDR-adjusted q closer to 0.6, which is no longer significant. More on the FDR mechanism here.
FDR is the gate that converts an in-sample p-value into a publication-grade q-value. Without it, every reported p-value is a single-test number being interpreted as if it were a search-wide claim. With it, the survivor either earns the title or doesn’t.
The Cascade Order Matters
The gates are not interchangeable. They run in a specific order because each one is most informative on the population the previous gate has already filtered.
| Stage | Gates | What survives |
|---|---|---|
| 1. Cheap pre-filter | holdout, hit_rate, percentile_floor, friction | Strategies with non-pathological basic stats |
| 2. Survival-OOS | walk_forward, purged_kfold, pbo | Strategies whose edge holds on unseen time and whose search wasn’t overfit |
| 3. Path / regime robustness | mc_block_bootstrap, cluster_stability | Strategies robust to path resampling and not concentrated in a 2-month window |
| 4. Multiple-comparison | spa or romano_wolf, then fdr | Strategies that survive search-wide null hypotheses |
A strategy that exits stage 4 is not guaranteed to make money. Markets are non-stationary; a real edge can decay. But it is the closest a backtest can come to a 5σ result, and it is what every Student One job is forced through by default.
The Common Plumbing: Purge, Embargo, Performance Panel
Every gate in stages 2–3 builds on the same data structure: a performance panel of shape [n_qualifiers × n_days], where cell [i, d] is the sum of post-cost daily MFE for qualifier i’s closed events with entry-day d. The panel is built once, reused everywhere. This is what makes the cascade fast enough to ship: the expensive operation is the indicator pass, not the gates.
Two helpers run alongside the panel: build_event_block_table() pre-computes per-event block coordinates, and build_leak_table() aggregates event contributions by (entry_block, exit_block) pairs. Both are used by every CV-style gate to compute the corrections that purge and embargo require. The result is that switching purge_overlapping_events from off to on is a single boolean in the gate config; the platform handles the per-event accounting.
The Permutation and Bootstrap Engine
Underneath the gates sit two statistical primitives:
- Phipson-Smyth p-values (
permutation.rs): unbiased estimatorp = (1 + count_passes) / (1 + k_replicates). Uses 1+ in the numerator so a strategy that beats the null on every single permutation does not collapse top = 0(which would be infinite-precision nonsense). - Stationary block bootstrap (Politis-Romano 1994): geometric block-length resampling with serial-correlation preservation. The block length is computed from the autocorrelation function of the panel, not chosen by the operator.
Both primitives are deterministic given a seed; the seed is recorded with every job. A reviewer can reproduce any cascade output bit-for-bit from the job manifest. This is the property that distinguishes a research artefact from a marketing claim.
The Modes
Not every job runs every gate. The platform ships three preset cascade modes:
- Quick: survival-OOS off. For exploratory sweeps where the operator wants to see indicator behaviour before committing to the full cascade.
- Pro: walk_forward as the per-strategy survival gate. Adequate for individual strategy validation; misses selection-process overfitting.
- Survival: pbo as the survival-OOS gate. The default for any job whose output will be promoted to position-sizing or live trading. Catches selection overfitting that Pro misses.
The modes are not opinions about which gate is "best." They are bands on cost vs rigour. The strictest mode runs the entire menu; the cheapest skips the combinatorial enumeration. The choice is a budget decision, and the cascade records which mode was used so reviewers can re-run the strict version on candidates that survived the cheap one.
What This Buys You
An ordinary backtest tells you: "this strategy made money on this history." A backtest passed through this cascade tells you: "this strategy made money on this history, and survived a permutation null, and survived rolling walk-forward with sign-flip checking, and survived purged K-fold for regime robustness, and the search procedure that produced it is not overfit at the selection level, and the equity curve is not a single lucky path, and the edge is not concentrated in a 2-month cluster, and the family-wise comparison against the search universe is significant, and the FDR-adjusted p-value remains below 0.05."
That is nine independent ways for the strategy to be wrong, each one specifically engineered to catch a failure mode the others miss. A strategy that clears the cascade is not guaranteed to make money. It is, however, the closest a quant pipeline can come to handing you something where the historical fit cannot trivially be explained by selection effect, leakage, lucky path, regime concentration, or in-sample tuning.
If your current pipeline ships strategies that have been validated by walk-forward alone, you are shipping noise more often than you think. The fix is the rest of the menu. Each gate has a name, a paper, and an open-source implementation. There is no longer an excuse to run only one of them.