# Backtesting Notes

**Companion to:** `spec.md`
**Scope:** Historical evaluation of the 10x spread screener. Separate from live scanning.

---

## 1. Why backtest separately

Live scanning is a feed-forward decision: pull current chain, compute features against current state, emit candidates. Backtesting reverses the direction: at every historical timestamp, simulate what the screener would have output and what the realized P&L of the recommended trades would have been.

The two flows have **different data requirements, different failure modes, and different validation criteria**. Mixing them in one code path is how look-ahead bias creeps in.

---

## 2. Data requirements (point-in-time)

Every adapter call MUST accept `as_of_timestamp` and return data as it existed at that timestamp. No exceptions. Concretely:

| Data | Source | Point-in-time guarantee |
|---|---|---|
| Option chains (bid/ask/IV/greeks) | ORATS daily aggregates, or your own captured Tradier snapshots | Daily close minimum; intraday if available |
| Underlying daily bars | Tradier history, Polygon, or local cache | Must be adjusted for splits + dividends consistent with as_of |
| Skew history | ORATS hist/strikes | Already point-in-time; verify trade_date matches |
| Event calendar | SEC EDGAR submissions + earnings vendor | Use filing_date <= as_of_timestamp |
| Realized post-event moves | Computed from underlying bars | Use bar[event_date] -> bar[event_date+1] |

**Survivorship handling:** the universe at any historical timestamp may differ from today's universe. Pre-build a delisted-tickers calendar; if a backtest run includes ticker X with as_of < X's delisting date, include it. Otherwise exclude.

---

## 3. Walk-forward protocol

```
for asof_ts in business_days(start_date, end_date):
    candidates = screener.run({
      universe: universe_as_of(asof_ts),
      modes: ["event_mode", "momentum_mode"],
      dteWindow: [21, 90],
      spreadTypes: ["bull_call", "bear_put"],
      asOfTimestamp: asof_ts,
    })
    for c in candidates:
        if c.total_score >= alert_threshold:
          paper_trade(c, fill_model="conservative")
```

The screener must accept `asOfTimestamp` (already specified in `ScreenerRunRequest`) and every adapter must respect it.

### Fill model — conservative

Real fills are worse than mid. The backtest's fill model should:

```
entry_fill = entry_mid + 0.5 * (sum_half_spreads)
            // 50% of half-spread penalty
exit_fill = ask_at_exit - 0.5 * (sum_half_spreads)
            // close at worse of mid - 50% of half-spread, modeling slippage out
fees = 2 * cfg.structural.fees_per_contract
       // two legs, two transactions
realized_pnl = (exit_fill - entry_fill - 0.5 * sum_half_spreads_at_exit) - fees
```

Track three fill-cost scenarios:
- **Optimistic:** mid + 0.25 half-spread (matches `entry_debit_est`)
- **Conservative:** mid + 0.50 half-spread (recommended for performance reporting)
- **Pessimistic:** ask (worst-case)

Report all three. Real-world P&L will be between conservative and pessimistic on most names.

---

## 4. Exit policy (for backtest)

Two exit policies, run both:

### Policy A: Hold to expiration
- Exit at the front expiration close.
- P&L = max(0, min(width, spot_at_exp - long_strike_for_bull_call)) - entry_debit
- (mirror for bear put)

### Policy B: Profit-target / stop-loss / time-decay
- Profit target: exit when spread mid >= 2.0 * entry_debit (configurable)
- Stop loss: exit when spread mid <= 0.30 * entry_debit
- Time exit: 5 calendar days before expiration if neither hit
- Whichever fires first

Most spread strategies in the literature use Policy B because reality includes early-close decisions. Always report Policy A as the "theoretical max" benchmark.

---

## 5. Metrics to compute per run

For each (mode, exit_policy, fill_scenario) combination:

- **Trade count**, hit rate (wins/total)
- **Average P&L per trade** (in dollars and in multiples of entry_debit)
- **Median P&L per trade** (calendars + verticals are right-skewed; median tells the central tendency)
- **Hit rate of 2x+, 5x+, 10x+ outcomes** — the key benchmark for a 10x screener
- **Maximum drawdown** of equity curve
- **Sharpe ratio** of trade-level returns (annualized assuming average 4 trades / month)
- **Sortino ratio** (downside-only)
- **Expectancy** = hit_rate * avg_win - (1 - hit_rate) * avg_loss
- **Profit factor** = sum(wins) / sum(|losses|)

### Distribution to report

A long debit vertical with high ten_x_room has fat-tailed returns. Median P&L is often slightly negative; the mean is positive only because a small number of trades 5x or 10x. Always:

1. Report median + mean
2. Report the *single best trade's* contribution to total P&L. If removing top 5% of trades flips total P&L negative, the strategy is *power-law dependent* and your statistical confidence interval needs Glasserman or bootstrap methods, not normal-approximation.

---

## 6. Multiple-comparisons hazard

If you grid-search thresholds (min_ten_x_room, ff_threshold, skew_lookback, etc.) and report the best combination's backtest, you've over-fit. Defenses:

- Reserve a 20% out-of-sample window from start, never look at it during tuning
- Use rolling cross-validation (train on 2022, test on 2023; train on 2023, test on 2024)
- Set thresholds from theory (10.5 is the math; 12.0 is a safety margin), not from optimization

---

## 7. Look-ahead pitfalls to actively guard against

| Pitfall | Defense |
|---|---|
| Using same-day OI in skew_percentile lookup | OI is lagged 1 day in the data; type-enforce the field name `open_interest_lagged` |
| Computing realized vol over a window that includes future bars | Always slice to bars where `bar.date < asof_ts` |
| Event_gap using future event date that wasn't confirmed at asof_ts | Filter event calendar to `filing_date <= asof_ts` |
| Skew history including the trade-decision day's own skew | History query must be `asof_date < asof_ts`, strict |
| Using current ticker's universe when backtesting 2022 | Use delisting-aware universe-as-of |

Add an integration test that asserts no adapter returns data with `asof_date >= asof_ts`.

---

## 8. Performance benchmarks to compare against

The screener is useful only if it outperforms naive baselines on the same universe + window. Always compute these alongside:

- **Naive baseline 1:** buy 30-DTE ATM call (or put) at random 1 per week — terrible Sharpe
- **Naive baseline 2:** buy ATM call with 60 DTE, hold 5 trading days — slightly better
- **Skew-only baseline:** buy the spread the screener picks but ignore all gates except liquidity — should be worse than full screener if scoring weights are right
- **Mode-shuffled control:** apply event_mode scoring to candidates that have no event — must score lower than candidates with real events

If your full screener doesn't beat baselines 1 + 2 on hit rate AND on multiples-of-debit terms, the scoring is wrong, not the universe.

---

## 9. Backtest output schema

Per-trade record:

```json
{
  "trade_id": "uuid-v4",
  "symbol": "NVDA",
  "spread_type": "bull_call",
  "mode": "event_mode",
  "asof_ts": "2026-04-15T16:00:00Z",
  "expiration": "2026-05-16",
  "long_strike": 920,
  "short_strike": 950,
  "entry_debit_optimistic": 2.34,
  "entry_debit_conservative": 2.55,
  "entry_debit_pessimistic": 3.10,
  "exit_date": "2026-05-09",
  "exit_reason": "profit_target",
  "exit_value": 12.50,
  "realized_pnl_conservative": 9.95,
  "multiple_of_debit": 3.90,
  "total_score_at_entry": 84.1,
  "rationale_tags": [...],
  "warnings": [...]
}
```

Per-run summary:

```json
{
  "run_id": "uuid-v4",
  "config_hash": "sha256:...",
  "start_date": "2022-01-01",
  "end_date": "2025-12-31",
  "universe_size": 80,
  "trades_total": 3247,
  "trades_per_mode": { "event_mode": 1820, "momentum_mode": 1427 },
  "metrics": {
    "event_mode_conservative": {
      "hit_rate": 0.55,
      "avg_multiple": 0.42,
      "median_multiple": -0.10,
      "hit_rate_2x_plus": 0.18,
      "hit_rate_5x_plus": 0.06,
      "hit_rate_10x_plus": 0.018,
      "sharpe": 0.71,
      "max_drawdown_pct": -0.34,
      "profit_factor": 1.42
    },
    "momentum_mode_conservative": { /* same shape */ }
  },
  "baselines": { /* same shape for naive baselines */ }
}
```

---

## 10. Operational notes

- **Don't run live + backtest concurrently against the same KV namespace.** Backtests can fill cache with stale data tagged by `asof_ts` that confuses the live flow.
- **Backtest runs are CPU-heavy.** A 3-year window × 80 tickers × daily granularity is ~60,000 candidate evaluations. Budget 30-60 minutes per full run; longer if pulling ORATS history without local cache.
- **Cache the ORATS skew snapshots locally.** Each unique (symbol, date, dte_bucket, delta_bucket) only needs to be fetched once across all backtest runs.
- **Version the config.** Hash the full config into `run_id` so backtest results from different parameter sets don't get conflated.

---

## 11. Validation milestones

Before declaring the screener "production validated":

1. ✅ Unit test coverage > 80% on `features.ts` + `scoring.ts` + `filters.ts`
2. ✅ Integration test with fixture chain reproduces a known candidate end-to-end
3. ✅ 2022-2024 backtest on top-80 universe shows hit_rate_2x_plus > 0.15 in event_mode (conservative fills)
4. ✅ Out-of-sample 2025 H1 confirms in-sample metrics within 25% degradation
5. ✅ Live shadow-run for 8 weeks where alerts fire but no trades execute; compare to manual eyeball of the same setups

Skip step 5 and you ship with no real-world calibration.