1bit.systems

Live tok/s + PPL

What we measure and what each number means. All numbers taken on the production box (strixhalo, 100.64.0.1) unless noted.

The table

metricvaluesourcewhat it means
Decode throughput @ L=6483 tok/sbench.shShort-context user-chat speed. What you feel typing a prompt.
Decode throughput @ L=102433 tok/sbench.shLong-context agent speed. KV-cache bandwidth dominates here.
PPL, wikitext-103 (gen-2 Rust)9.1805benchmarks/ppl-gen2.shDistribution-level quality. Lower is better.
PPL, wikitext-103 (gen-1 baseline)9.1607historicalReference point; gen-2 is +0.02, inside ±0.05 tolerance.
Shadow-burnin byte-exact95.55%halo burnin statsArgmax-level parity, gen-1 vs gen-2. 14,344 rounds.
Ternary GEMV roofline92% of LPDDR5 peakrocprofKernel is bandwidth-bound, not compute-bound. Bytes-read reduction (Sherry) is rank-1.
Split-KV Flash-Decoding attn6.78× vs prior @ L=2048benchmarks/attn_fd.shBit-exact speedup over single-block attention. Default since 2026-04-19.
Voice mouth-to-ear first audio1.23 sbenchmarks/voice.shEnd-to-end: STT + LLM + TTS first chunk. 3-5× faster than naive serial loop.
Tests across 13 crates201 passing, 0 failingcargo test --workspace --releaseWorkspace-wide green. CI gate.
1bit-server binary, stripped2.4 MBsize target/release/1bit-serverStatic-friendly Rust binary; ships without a runtime.
Landing live tok/spulled from /metrics via /_live/stats SSEcrates/1bit-landing/src/telemetry.rsThe number you see in the hero on https://strixhalo.local/ is no longer a static guess — it's the same tokps_recent the Prom scraper sees, pushed over SSE every 1.5 s.

One-liners per number

Reproducibility notes

Re-run locally with:

cd /home/bcloud/repos/halo-workspace
./benchmarks/bench.sh           # decode throughput
./benchmarks/ppl-gen2.sh        # PPL
./benchmarks/shadow-burnin.sh   # parity (long-running)
halo burnin stats               # live summary

Numbers regenerate into ~/claude output/.