Benjamini-Hochberg FDR: The Multiple-Testing Correction Every Backtester Forgets
Why testing 100,000 indicator configurations without FDR correction guarantees you will "discover" false signals — and how to fix it
If you test one trading signal at p < 0.05, you have a 5% chance of a false positive. If you test 100,000 signals at p < 0.05, you have a near-certainty of thousands of false positives. The Benjamini-Hochberg False Discovery Rate (FDR) procedure is the standard statistical correction for this — and almost no retail backtesting platform applies it.
The Multiple-Testing Problem
A p-value of 0.05 means: under the null hypothesis (no real edge), there is a 5% probability of observing a result this extreme by chance. Run the test once, that's a tolerable error rate. Run it 100,000 times, and you expect ~5,000 false positives even when nothing real is happening.
This is not a subtle effect. It is the dominant source of "discovered" strategies that fail in live trading. Every exhaustive parameter sweep that does not correct for multiple testing is producing a list dominated by noise survivors.
What FDR Controls
The False Discovery Rate is the expected proportion of false positives among all positive results. If you call 100 configurations "significant" with FDR controlled at 5%, you expect at most 5 of those to be false positives. The other 95 carry genuine statistical evidence.
FDR is the appropriate target for exploratory parameter sweeps — strictly tighter family-wise error rate controls (Bonferroni, Holm) become so conservative they reject nearly everything when the test count is large. FDR keeps statistical power while bounding false discoveries proportionally.
The Benjamini-Hochberg Procedure
The mechanics:
- Run all m hypothesis tests, collect p-values
- Sort p-values ascending: p(1) ≤ p(2) ≤ ... ≤ p(m)
- For each rank k, compute the BH threshold: k × α / m
- Find the largest k such that p(k) ≤ k × α / m
- Reject the null for all tests with rank ≤ k
The result: a calibrated set of "discovered" configurations where the expected false-positive proportion is bounded by α.
What This Looks Like in Practice
Suppose you run an exhaustive RSI sweep — periods 2 to 14,000, oversold/overbought thresholds in 1-point increments. That's roughly 14,000 × 100 × 100 = 140 million configurations. Without FDR, even at p < 0.01, you would expect 1.4 million false positives. With BH-FDR at α = 0.05, the procedure dynamically computes a much tighter per-test threshold so that the expected fraction of false positives among called survivors stays at 5%.
In typical sweeps, the BH-corrected threshold ends up at p < 1e-7 or tighter. The number of "significant" configurations drops from millions to dozens or hundreds — and those that remain carry real statistical evidence, not noise.
Why Platforms Skip This
Retail backtesting platforms skip FDR correction for three reasons:
- Marketing — "we found 1.4 million profitable configurations" sells better than "we found 47 statistically defensible configurations"
- Workflow — single-pass optimizers produce one "best" configuration, not a corrected family of survivors, so there is no list to correct
- Methodological awareness — many platform developers come from software engineering backgrounds, not biostatistics, where FDR has been standard practice for two decades
The result: every "AI-discovered strategy" or "optimized indicator preset" you encounter on a retail platform was found without multiple-testing correction. The statistical claim is empty.
Romano-Wolf and Other Alternatives
For very high-dimensional parameter spaces with strong dependence structure (where individual tests are not independent), the Romano-Wolf bootstrap procedure provides tighter family-wise error control while accounting for cross-test correlation. Student One supports both BH-FDR and Romano-Wolf gates, with BH as the default and Romano-Wolf available when the configuration space exhibits high correlation (e.g., consecutive periods of the same indicator).
How Student One Applies FDR
Every exhaustive sweep runs through the FDR gate automatically. The output is two lists: configurations called significant after BH correction, and configurations rejected by the procedure. The output metadata documents:
- Total tests performed (m)
- Target FDR level (α)
- The actual corrected p-value threshold
- Per-configuration raw p-value and BH-adjusted q-value
- Citation: Benjamini, Y. and Hochberg, Y. (1995), "Controlling the False Discovery Rate"
This is the structure expected by institutional research workflows and academic peer review.
Summary
Without multiple-testing correction, exhaustive parameter enumeration is just an industrial-scale fishing expedition. With Benjamini-Hochberg FDR (or Romano-Wolf for high-correlation spaces), the same enumeration becomes a calibrated statistical procedure. The difference is whether your discovered signals survive live deployment or fail within weeks.