Walk-forward backtest — 927 tickers (500 SP500 + 501 Russell 1000 supplement)
We pre-registered a single hypothesis: companies labelled near_critical by the Phase Detector should produce a different risk-adjusted forward return than the rest of a ~1000-ticker U.S. equity universe across a 5-year walk-forward (2020-2025). The result is alpha NOT detected at scale — Sharpe lift = -0.072 (p = 0.569). Per W7-D Track A, this confirms the pivot to structured-research-narrative positioning. We publish the null openly.
Headline numbers (monthly forward hold)
Across 59 monthly snapshots from 2020-12-01 to 2025-12-01 (927/1001 tickers fetched), the Phase Detector’s near_critical cohort (Sharpe 0.70) underperforms the equal-weight benchmark (Sharpe 0.77) by 0.072 Sharpe units — well below the W7-D +0.5 Sharpe-lift threshold that would justify alpha-screener positioning. The other (stable) cohort marginally outperforms benchmark (Sharpe 0.93, lift +0.155, p = 0.801), but no comparison clears α = 0.05. The headline test (paired t-test on monthly cohort-minus-benchmark differences) confirms: on this specification, the Phase Detector label does not carry tradable directional information. Per the W7-D pre-committed pivot plan, we proceed with structured-research- narrative positioning, with this backtest result published as the credibility moat.
Cumulative monthly group means (vs benchmark)
Compounded equal-weight monthly mean returns for each cohort, anchored at 2020-12-31. The three curves move together — the headline paired t-test confirms the visual: no persistent separation. Note near_critical exhibits the largest drawdowns (2022 tightening; H1 2025 vol shock).
Raw data: /api/backtest-cumulative · /api/backtest-result
Pre-registered limitations
These caveats were known before the test ran. They define what this result can and cannot say.
- Label leakage risk. All 500 Wave 1 StructTuples were classified once at 2026-05-14, then applied retroactively to every snapshot from 2020-12 onward. The model may have used information unavailable at the historical snapshot date. A per-snapshot label refresh (re-classify each company as of each T) is the strict-causality version of this test and remains the single most expensive — and most likely to flip the result — experiment we have not yet run.
- Russell 1000 supplement is a curated 501-name sample, not the official R1000. FTSE Russell licenses membership; we cannot redistribute the full list. Our supplement is sector-balanced and large/mid-cap, but it is not point-in-time R1000 reconstitution. True point-in-time membership would mitigate this.
- Survivorship bias. 74/1001 tickers were dropped due to delisting / yfinance edge cases. We did not include historically-listed-then-delisted tickers in our universe, biasing benchmark return upward by ~1–2% annually.
- Static labels on R1000 supplement (no labels at all). Only the 500 SP500 tickers carry LLM critical_point_state labels. The heuristic price-regime cohort uses a 90-day vol + 180-day drawdown placebo — a baseline, not a Phase Detector test.
- Horizon sweep was minimal. 1-month and 6-month forward holds reported; no full 3/12/24-month grid. Any post-hoc sweep (horizon × threshold × sector × dynamics-family) needs Benjamini-Hochberg correction.
- No transaction costs, borrow costs, or sector neutralisation. Pure equal-weighted long-only cohort means. Real portfolio implementations would face additional decay.
What’s next
The null result is informative but specification-bound. Per W7-D, we proceed with structured-research-narrative positioning. Backlog (priority-ranked, cheapest first) for v0.2 if positioning traction warrants:
- Sector-stratified test (cheap, 1 day). Run the paired t-test within each GICS sector. The signal may live in healthcare or tech and be diluted in financials/utilities. Benjamini-Hochberg correction over 11 sectors.
- dynamics_family stratified test (cheap, 1 day). 9 families × 2 cohorts = 18 tests, BH-adjusted. The signal may live in motter_lai_cascade or scheffer_fold companies specifically.
- Confidence-filtered cohort (cheap, 0.5 day). Restrict near_critical to
confidence ≥ 0.8labels only (54% of universe). Reduces cohort size but may sharpen signal. - Per-snapshot label refresh (expensive, $500–$1k LLM cost, 2–3 days). Re-classify each company at each historical T using only information available at that T. Removes the leakage risk noted above. This is the single most likely experiment to flip the result.
- Russell 2000 universe (medium cost). True small-cap is where phase transitions should bite. R2000 = 2000 names, fetch cost ~10 min, label cost ~$2–$5k if we LLM-classify the new 2000 tickers.
Method, label definitions, and the raw Wave 1 StructTuples are on the methodology page. Backtest code is open-source in the GitHub repository. This page is a research preview; nothing here is investment advice.
Generated at 2026-05-14T19:42:39.064135+00:00 · backtest version 0.1-1000ticker · mode walk_forward