Live tok/s + PPL

What we measure and what each number means. All numbers taken on the production box (strixhalo, 100.64.0.1) unless noted.

The table

metric	value	source	what it means
Decode throughput @ L=64	83 tok/s	`bench.sh`	Short-context user-chat speed. What you feel typing a prompt.
Decode throughput @ L=1024	33 tok/s	`bench.sh`	Long-context agent speed. KV-cache bandwidth dominates here.
PPL, wikitext-103 (gen-2 Rust)	9.1805	`benchmarks/ppl-gen2.sh`	Distribution-level quality. Lower is better.
PPL, wikitext-103 (gen-1 baseline)	9.1607	historical	Reference point; gen-2 is +0.02, inside ±0.05 tolerance.
Shadow-burnin byte-exact	95.55%	`halo burnin stats`	Argmax-level parity, gen-1 vs gen-2. 14,344 rounds.
Ternary GEMV roofline	92% of LPDDR5 peak	`rocprof`	Kernel is bandwidth-bound, not compute-bound. Bytes-read reduction (Sherry) is rank-1.
Split-KV Flash-Decoding attn	6.78× vs prior @ L=2048	`benchmarks/attn_fd.sh`	Bit-exact speedup over single-block attention. Default since 2026-04-19.
Voice mouth-to-ear first audio	1.23 s	`benchmarks/voice.sh`	End-to-end: STT + LLM + TTS first chunk. 3-5× faster than naive serial loop.
Tests across 13 crates	201 passing, 0 failing	`cargo test --workspace --release`	Workspace-wide green. CI gate.
`1bit-server` binary, stripped	2.4 MB	`size target/release/1bit-server`	Static-friendly Rust binary; ships without a runtime.
Landing live tok/s	pulled from `/metrics` via `/_live/stats` SSE	`crates/1bit-landing/src/telemetry.rs`	The number you see in the hero on `https://strixhalo.local/` is no longer a static guess — it's the same `tokps_recent` the Prom scraper sees, pushed over SSE every 1.5 s.

One-liners per number

83 @ 64 / 33 @ 1024 — same kernel; the drop is pure KV-cache bandwidth. A 70B FP16 model on the same box does ~18 tok/s. See Why ternary?.
PPL 9.1805 — we re-pack Microsoft's weights (no retraining). The +0.02 delta vs gen-1 is FP16-reordering noise, not quality loss.
95.55% byte-exact — 74.9% of the remaining 4.45% is one prompt (idx=7, "chemical symbol for gold"). Fix that sampler delta and parity jumps to ~98.9%. See Why parity gates?.
92% of peak — roofline math from rocprof confirms the ternary GEMV is memory-bound. Compute-side optimization has near-zero headroom; Sherry 1.25-bit packing (bytes-read reduction) is the next 15-25% lever.
6.78× attn — one block per head → split across blocks with an online-softmax reduction. Bit-exact means PPL didn't move when we flipped the default.
1.23 s voice — whisper.cpp partials + sentence-boundary TTS streaming. Naive serial (wait-full-STT → wait-full-LLM → wait-full-TTS) is ~4-6 s.
201 tests — includes property tests on tokenizer round-trips and parity assertions on .h1b load paths.
2.4 MB — the whole serving daemon. For comparison, gen-1 Python install was ~1.2 GB.

Reproducibility notes

Box: Ryzen AI 9 HX 370 (Strix Halo / gfx1151), 128 GB LPDDR5 @ 256 GB/s, CachyOS, kernel 7.0.
Ambient: 22-24 °C room temperature, closet airflow, headless. Thermal throttle shows up above ~28 °C — numbers above are the 22 °C set.
Power: halo power balanced profile, 150 W sustained TDP, voltage curve at default. See Why halo power? for the under-volt recipe that adds ~4% throughput without stability loss.
Model: microsoft/bitnet-b1.58-2B-4T → .h1b v3, 1.8 GB on disk.
Temperature: 0 (greedy argmax) for parity + PPL; 0.7 for voice UX only.

Re-run locally with:

cd /home/bcloud/repos/halo-workspace
./benchmarks/bench.sh           # decode throughput
./benchmarks/ppl-gen2.sh        # PPL
./benchmarks/shadow-burnin.sh   # parity (long-running)
halo burnin stats               # live summary

Numbers regenerate into ~/claude output/.

Live tok/s + PPL

The table

One-liners per number

Reproducibility notes

Links