Phase Detector
Transparency report v0.1 · 2026-05-15Last updated 2026-05-14

Walk-forward backtest — 927 tickers (500 SP500 + 501 Russell 1000 supplement)

We pre-registered a single hypothesis: companies labelled near_critical by the Phase Detector should produce a different risk-adjusted forward return than the rest of a ~1000-ticker U.S. equity universe across a 5-year walk-forward (2020-2025). The result is alpha NOT detected at scale — Sharpe lift = -0.072 (p = 0.569). Per W7-D Track A, this confirms the pivot to structured-research-narrative positioning. We publish the null openly.

Headline numbers (monthly forward hold)

Sharpe lift vs benchmark
-0.072
near_critical − benchmark EW
p-value (paired t)
0.569
two-sided · 59 months
benchmark Sharpe
0.77
annualised EW universe
snapshots
59
monthly · 2020-12-01 → 2025-12-01
near_critical cohort
65
SP500 tickers · LLM-labelled
other (stable) cohort
434
far_from + post_critical
Sharpe — near_critical
0.70
mean 1.38%/mo · σ 6.8%
Sharpe — other (stable)
0.93
mean 1.22%/mo · σ 4.6%
Verdict · W7-D Track A month-3 decision gate
Sharpe lift -0.072 · p = 0.569 → NULL hypothesis NOT rejected

Across 59 monthly snapshots from 2020-12-01 to 2025-12-01 (927/1001 tickers fetched), the Phase Detector’s near_critical cohort (Sharpe 0.70) underperforms the equal-weight benchmark (Sharpe 0.77) by 0.072 Sharpe units — well below the W7-D +0.5 Sharpe-lift threshold that would justify alpha-screener positioning. The other (stable) cohort marginally outperforms benchmark (Sharpe 0.93, lift +0.155, p = 0.801), but no comparison clears α = 0.05. The headline test (paired t-test on monthly cohort-minus-benchmark differences) confirms: on this specification, the Phase Detector label does not carry tradable directional information. Per the W7-D pre-committed pivot plan, we proceed with structured-research- narrative positioning, with this backtest result published as the credibility moat.

Cumulative monthly group means (vs benchmark)

Compounded equal-weight monthly mean returns for each cohort, anchored at 2020-12-31. The three curves move together — the headline paired t-test confirms the visual: no persistent separation. Note near_critical exhibits the largest drawdowns (2022 tightening; H1 2025 vol shock).

Raw data: /api/backtest-cumulative · /api/backtest-result

Pre-registered limitations

These caveats were known before the test ran. They define what this result can and cannot say.

  1. Label leakage risk. All 500 Wave 1 StructTuples were classified once at 2026-05-14, then applied retroactively to every snapshot from 2020-12 onward. The model may have used information unavailable at the historical snapshot date. A per-snapshot label refresh (re-classify each company as of each T) is the strict-causality version of this test and remains the single most expensive — and most likely to flip the result — experiment we have not yet run.
  2. Russell 1000 supplement is a curated 501-name sample, not the official R1000. FTSE Russell licenses membership; we cannot redistribute the full list. Our supplement is sector-balanced and large/mid-cap, but it is not point-in-time R1000 reconstitution. True point-in-time membership would mitigate this.
  3. Survivorship bias. 74/1001 tickers were dropped due to delisting / yfinance edge cases. We did not include historically-listed-then-delisted tickers in our universe, biasing benchmark return upward by ~1–2% annually.
  4. Static labels on R1000 supplement (no labels at all). Only the 500 SP500 tickers carry LLM critical_point_state labels. The heuristic price-regime cohort uses a 90-day vol + 180-day drawdown placebo — a baseline, not a Phase Detector test.
  5. Horizon sweep was minimal. 1-month and 6-month forward holds reported; no full 3/12/24-month grid. Any post-hoc sweep (horizon × threshold × sector × dynamics-family) needs Benjamini-Hochberg correction.
  6. No transaction costs, borrow costs, or sector neutralisation. Pure equal-weighted long-only cohort means. Real portfolio implementations would face additional decay.

What’s next

The null result is informative but specification-bound. Per W7-D, we proceed with structured-research-narrative positioning. Backlog (priority-ranked, cheapest first) for v0.2 if positioning traction warrants:

  1. Sector-stratified test (cheap, 1 day). Run the paired t-test within each GICS sector. The signal may live in healthcare or tech and be diluted in financials/utilities. Benjamini-Hochberg correction over 11 sectors.
  2. dynamics_family stratified test (cheap, 1 day). 9 families × 2 cohorts = 18 tests, BH-adjusted. The signal may live in motter_lai_cascade or scheffer_fold companies specifically.
  3. Confidence-filtered cohort (cheap, 0.5 day). Restrict near_critical to confidence ≥ 0.8 labels only (54% of universe). Reduces cohort size but may sharpen signal.
  4. Per-snapshot label refresh (expensive, $500–$1k LLM cost, 2–3 days). Re-classify each company at each historical T using only information available at that T. Removes the leakage risk noted above. This is the single most likely experiment to flip the result.
  5. Russell 2000 universe (medium cost). True small-cap is where phase transitions should bite. R2000 = 2000 names, fetch cost ~10 min, label cost ~$2–$5k if we LLM-classify the new 2000 tickers.

Method, label definitions, and the raw Wave 1 StructTuples are on the methodology page. Backtest code is open-source in the GitHub repository. This page is a research preview; nothing here is investment advice.

Generated at 2026-05-14T19:42:39.064135+00:00 · backtest version 0.1-1000ticker · mode walk_forward