1bit.systems

Peak Performance Projection

Quantitative projection of 1bit systems end-state throughput on Strix Halo (Ryzen AI MAX+ 395) once all seven planned lanes ship. Every number has a derivation or a citation. All projections carry a ±20% band unless stated otherwise.

0. Fixed constants

constantvaluesource
LPDDR5-8000 peak bandwidth256 GB/svendor spec, Zen5 APU
iGPU FP16 peak60 TFLOPSRadeon 8060S, 16 WGPs @ ~2.4 GHz
iGPU effective (measured on ternary GEMV)24 TFLOPS (40% util)bench.sh + rocprof
NPU INT8 peak50 TOPSxrt-smi validate, IRON #55
NPU effective (42% of peak, BF16 GEMM)21 TOPSIRON #55, andrej comment 3706297989
NPU effective, int2-via-int8 (2× pack factor)~42 TOPStwo ternary weights per INT8 MAC lane
BitNet-b1.58-2B-4T: hidden / layers / ffn2560 / 30 / 6912Microsoft HF config
Ternary weight footprint (core MM only)~630 MB2.4B params × 1.58 bits + packing overhead
.h1b on disk (full model incl. embed/tok)1.8 GBBenchmarks.md

1. Current ceiling per surface (measured 2026-04-20)

surfacedecode tok/s todayprefill tok/s todaybound by
iGPU (gfx1151)83 @ L=64, 33 @ L=1024~220 @ L=512 (measured)LPDDR5 bw (92% of peak)
CPU (32× Zen5)0 (idle)0not wired
NPU (XDNA2, 8×4 AIE2P)00IRON #93 int8 kernel not yet ported to Peano

Bandwidth-bound ceiling on the ternary GEMV today: 256 GB/s ÷ 630 MB/tok ≈ 406 tok/s theoretical. Measured 83 tok/s @ L=64 means the reachable roofline after amortizing KV + attn + sampler is ~85–90 tok/s — consistent with Benchmarks.md. At L=1024 the KV-cache dominates: 33 tok/s × 92% × 256 GB/s ≈ 7.1 GB read per token, essentially all KV.

2. Decode ceiling math

Formula: tok/s ≤ (BW × util) / (W_bytes × actsparse_factor) where util = 0.92 from the rocprof roofline.

configW_bytesactsparseceilingrealistic (±20%)source
FP16 2.4B baseline4.80 GB1.049 tok/s35–45llama.cpp fp16 on same box
Ternary 1.58-bit (today)0.63 GB1.0374 tok/s83 @ L=64bench.sh, measured
+Sherry 1.25-bit 3:40.50 GB (×0.79)1.0471 tok/s105–1251.25/1.58 bytes ratio, spike committed 2026-04-18
+Sherry +activation sparsity 30% eff0.35 GB0.70673 tok/s140–17079.91% measured sparsity, 30% usable after DRAM granularity penalty
+Sherry +actsparse +BitNet v2 W1.58A4same Wsamesame @ short ctx140–170 @ L≤2562504.18415
+Sherry +actsparse +A4 @ L=2048KV /4~132 tok/s (vs 22 today)110–160a4 KV is 0.25× fp16 KV; fd-attn 6.78× already landed
+all +Medusa 1.7× accept240–290 @ L=64; 190–270 @ L=2048MedusaBitNet, 1.5–1.8× accepted-token range

The 83 → ~280 tok/s path is compounding on a bandwidth-bound stack: every lane that shrinks bytes-per-token lifts the same wall. Medusa multiplies on top because it issues ≥2 tokens per weight-read pass.

3. Prefill ceiling (compute-bound, NPU wins)

Prefill per-token FLOPs (fwd, attn + FFN, BitNet-2B):

F/tok ≈ 2 × (4·h² + 3·h·ffn) × layers
      = 2 × (4·2560² + 3·2560·6912) × 30
      = 2 × (26.2M + 53.1M) × 30
      ≈ 4.76 GFLOPs/tok
surfaceeffective TOPSprefill tok/sdispatch overhead
iGPU (FP16)24 TFLOPS24e12 / 4.76e9 ≈ 5 000 tok/s~0.2 ms
NPU (INT8 straight)21 TOPS21e12 / 4.76e9 ≈ 4 400 tok/s2–5 ms
NPU (int2 via INT8 MAC)~42 TOPS42e12 / 4.76e9 ≈ 8 800 tok/s2–5 ms

Crossover L\* where NPU total time beats iGPU total time, solving t_npu_oh + L/npu_tps = t_igpu_oh + L/igpu_tps:

L* = (3ms − 0.2ms) / (1/5000 − 1/8800)
   = 2.8e-3 / (2.00e-4 − 1.14e-4)
   = 2.8e-3 / 8.6e-5
   ≈ 33 tokens

NPU beats iGPU at prefill beyond ~33 tokens. For realistic chat prompts (128–2048 in) the NPU is the right surface unconditionally. Uncertainty band: 25–60 tokens depending on actual BD-list setup cost (IRON #93 has no live timing yet).

4. All-surfaces-parallel latency (512-in + 256-out)

stagesurfacetimenote
Tokenize 512 inCPU (1 core)4 ms1bit-core BPE, already measured
Prefill 512 tokNPU512 / 8 800 + 3 ms ≈ 61 msint2-via-int8 projection
RoPE tables + dispatchCPU<1 msprecomputed, cached
Decode 256 tok @ L≈768iGPU256 / 220 ≈ 1.16 sSherry + actsparse + a4 + Medusa 1.7×
Detokenize + stream outCPU<5 msoverlaps decode

TTFT ≈ 70 ms, total wall ≈ 1.23 s for 512+256. Today the same workload is ~4.1 s (prefill on iGPU, decode bottleneck). Projected wall-clock improvement: ~3.3× end-to-end.

5. End-to-end decode tok/s ceiling (2B model, L=64)

tiershipping lanesprojectionassumptions
Conservative2 of 7 (Sherry + fd-attn)110–130 tok/sno NPU, no Medusa, a4/actsparse deferred
Realistic5 of 7 (Sherry + fd-attn + actsparse + a4 + Medusa)190–240 tok/sNPU still dark; decode is iGPU-only
Aspirational7 of 7 + clean ROCm 7.2 driver260–300 tok/sNPU prefill frees iGPU for decode-only duty cycle; driver jitter <3%

The 280 tok/s headline number assumes (a) Sherry packs to 1.25 bits with no PPL regression >0.1, (b) BitNet v2 a4 KV is bit-exact vs fp16 KV on our re-pack path, (c) Medusa 1.7× accept rate on our workload (lower than MedusaBitNet's reported 2.3× because our baseline is already faster).

6. The bandwidth wall

Even at 280 tok/s decode with every lane shipping, the final ceiling is the LPDDR5-8000 controller. Derivation of the wall:

theoretical max = 256 GB/s × 0.92 util / (0.35 GB/tok best-case)
               ≈ 673 tok/s

Medusa multiplies tokens-per-weight-fetch by 1.7×, which is the only way past ~400 tok/s on this bus. Beyond that you need:

at batch=1).

For the 2026-04 Strix Halo box, 256 GB/s is the terminal ceiling. Everything we ship between now and DDR6 is bytes-per-token reduction.

Sources

benchmarks/ppl-gen2.sh, benchmarks/attn_fd.sh.

TOPS), IRON #93 (INT8 kernel), PR #94.

BitNet v2.

project_activation_sparsity_phase1.md, project_attention_fd.md, project_npu_path_analysis.md.

One-line homepage summary

Projected ceiling: ~280 tok/s decode + NPU prefill crossover at ~33 tokens once all seven lanes ship, at the 256 GB/s LPDDR5 wall.