Why shadow-traffic parity gates?
One-line answer: unit tests verify behaviors; parity gates verify indistinguishability from the previous version. A PR that adds a feature and keeps tests green can still subtly shift logits. Parity says: prove it's the same.
Tests vs parity
| dimension | unit tests | parity gates |
|---|---|---|
| what | "does this function return the right thing" | "does gen-2 give the same bytes as gen-1" |
| shape | fixed inputs, asserted outputs | live prompt stream, byte-compare |
| coverage | one code path | emergent end-to-end behavior |
| breaking cost | cheap fix | cannot ship until green |
Unit tests run on every PR. Parity runs before every cutover.
The two harnesses
Shadow-burnin (argmax byte-compare)
benchmarks/shadow-burnin.sh fires the same prompt at /v1 (gen-1 C++ bitnet_decode on :8080) and /v2 (gen-2 Rust 1bit-server on :8180), diffs the replies byte-for-byte. Pace: ~1 round / 2-3 s.
- Current: 95.55% byte-exact over 14,344 rounds (2026-04-20)
- Drift analysis: 74.9% of all misses trace to ONE prompt —
idx=7, "chemical symbol for gold". That's a single sampler delta, not a systemic divergence. Fix it and parity jumps to ~98.9%.
More detail on what the not-byte-exact rounds mean in Why shadow-burnin? — short version: sub-ULP FP16 noise propagated through 30 layers flips argmax at logit ties. Not a bug.
PPL (distribution-level)
benchmarks/ppl-gen2.sh runs wikitext-103 perplexity on both servers.
- gen-1 baseline: 9.1607
- gen-2 current: 9.1805
- delta: +0.02, inside the ±0.05 tolerance → PASS
PPL measures the distribution both models see. Shadow-burnin measures the argmax pick. Different failure modes; we gate on both.
Why we gate the gates
From CLAUDE.md §Testing:
Parity vs gen-1 is the ultimate cutover gate.
A PR that passes cargo test --workspace --release but drops parity below threshold cannot ship. The human running cutover runs:
halo burnin stats # overall byte-exact rate
halo burnin drift # what's missing and why
halo burnin recent # last N rounds
halo burnin since 2026-04-19T00:00:00Z
If any of those trend down in the 48 hours before cutover, we don't flip. Receipts matter more than vibes.
The operator interface
halo burnin {stats,drift,recent,since} reads the JSONL log at ~/claude output/shadow-burnin.jsonl (note the space in the path, quote it). State persists at ~/.local/share/1bit systems/shadow-burnin.state.
halo burnin stats
# bytes_exact: 14344 / 15012 (95.55%)
# top_miss: idx=7 "chemical symbol..." (75.0% of misses)
# ppl_gen1: 9.1607
# ppl_gen2: 9.1805 (Δ +0.02) PASS
The top_miss field is the actionable output. Drift on one prompt is a sampler bug. Drift on many prompts is a model bug.
Why this rigor
This is a single-box project serving requests to a single user. There's no A/B infra, no gradual rollout, no blue-green. The only way to flip gen-1 → gen-2 safely is to prove the behaviors match before flipping the route in Caddy.
We've been burned twice:
- The KV-cache reset bug (2026-04-19 SEGV) — PPL passed, only burnin caught it at round ~200.
- The RoPE convention bug (2026-04-19) — PPL reported 524 on long context. Caught before any cutover attempt.
Both would have shipped if the gate was "tests pass." Parity gates are the seatbelt.
Pointers
- Shadow-burnin harness:
benchmarks/shadow-burnin.sh - PPL harness:
benchmarks/ppl-gen2.sh - Data:
~/claude output/shadow-burnin.jsonl - Deep dive: Why shadow-burnin?
- Formal rule:
CLAUDE.md§Testing