Peak Performance Projection

Quantitative projection of 1bit systems end-state throughput on Strix Halo (Ryzen AI MAX+ 395) once all seven planned lanes ship. Every number has a derivation or a citation. All projections carry a ±20% band unless stated otherwise.

0. Fixed constants

constant	value	source
LPDDR5-8000 peak bandwidth	256 GB/s	vendor spec, Zen5 APU
iGPU FP16 peak	60 TFLOPS	Radeon 8060S, 16 WGPs @ ~2.4 GHz
iGPU effective (measured on ternary GEMV)	24 TFLOPS (40% util)	bench.sh + rocprof
NPU INT8 peak	50 TOPS	`xrt-smi validate`, IRON #55
NPU effective (42% of peak, BF16 GEMM)	21 TOPS	IRON #55, andrej comment 3706297989
NPU effective, int2-via-int8 (2× pack factor)	~42 TOPS	two ternary weights per INT8 MAC lane
BitNet-b1.58-2B-4T: hidden / layers / ffn	2560 / 30 / 6912	Microsoft HF config
Ternary weight footprint (core MM only)	~630 MB	2.4B params × 1.58 bits + packing overhead
.h1b on disk (full model incl. embed/tok)	1.8 GB	Benchmarks.md

1. Current ceiling per surface (measured 2026-04-20)

surface	decode tok/s today	prefill tok/s today	bound by
iGPU (gfx1151)	83 @ L=64, 33 @ L=1024	~220 @ L=512 (measured)	LPDDR5 bw (92% of peak)
CPU (32× Zen5)	0 (idle)	0	not wired
NPU (XDNA2, 8×4 AIE2P)	0	0	IRON #93 int8 kernel not yet ported to Peano

Bandwidth-bound ceiling on the ternary GEMV today: 256 GB/s ÷ 630 MB/tok ≈ 406 tok/s theoretical. Measured 83 tok/s @ L=64 means the reachable roofline after amortizing KV + attn + sampler is ~85–90 tok/s — consistent with Benchmarks.md. At L=1024 the KV-cache dominates: 33 tok/s × 92% × 256 GB/s ≈ 7.1 GB read per token, essentially all KV.

2. Decode ceiling math

Formula: tok/s ≤ (BW × util) / (W_bytes × actsparse_factor) where util = 0.92 from the rocprof roofline.

config	W_bytes	actsparse	ceiling	realistic (±20%)	source
FP16 2.4B baseline	4.80 GB	1.0	49 tok/s	35–45	llama.cpp fp16 on same box
Ternary 1.58-bit (today)	0.63 GB	1.0	374 tok/s	83 @ L=64	bench.sh, measured
+Sherry 1.25-bit 3:4	0.50 GB (×0.79)	1.0	471 tok/s	105–125	1.25/1.58 bytes ratio, spike committed 2026-04-18
+Sherry +activation sparsity 30% eff	0.35 GB	0.70	673 tok/s	140–170	79.91% measured sparsity, 30% usable after DRAM granularity penalty
+Sherry +actsparse +BitNet v2 W1.58A4	same W	same	same @ short ctx	140–170 @ L≤256	2504.18415
+Sherry +actsparse +A4 @ L=2048	KV /4	—	~132 tok/s (vs 22 today)	110–160	a4 KV is 0.25× fp16 KV; fd-attn 6.78× already landed
+all +Medusa 1.7× accept	—	—	—	240–290 @ L=64; 190–270 @ L=2048	MedusaBitNet, 1.5–1.8× accepted-token range

The 83 → ~280 tok/s path is compounding on a bandwidth-bound stack: every lane that shrinks bytes-per-token lifts the same wall. Medusa multiplies on top because it issues ≥2 tokens per weight-read pass.

3. Prefill ceiling (compute-bound, NPU wins)

Prefill per-token FLOPs (fwd, attn + FFN, BitNet-2B):

F/tok ≈ 2 × (4·h² + 3·h·ffn) × layers
      = 2 × (4·2560² + 3·2560·6912) × 30
      = 2 × (26.2M + 53.1M) × 30
      ≈ 4.76 GFLOPs/tok

surface	effective TOPS	prefill tok/s	dispatch overhead
iGPU (FP16)	24 TFLOPS	24e12 / 4.76e9 ≈ 5 000 tok/s	~0.2 ms
NPU (INT8 straight)	21 TOPS	21e12 / 4.76e9 ≈ 4 400 tok/s	2–5 ms
NPU (int2 via INT8 MAC)	~42 TOPS	42e12 / 4.76e9 ≈ 8 800 tok/s	2–5 ms

Crossover L\* where NPU total time beats iGPU total time, solving t_npu_oh + L/npu_tps = t_igpu_oh + L/igpu_tps:

L* = (3ms − 0.2ms) / (1/5000 − 1/8800)
   = 2.8e-3 / (2.00e-4 − 1.14e-4)
   = 2.8e-3 / 8.6e-5
   ≈ 33 tokens

NPU beats iGPU at prefill beyond ~33 tokens. For realistic chat prompts (128–2048 in) the NPU is the right surface unconditionally. Uncertainty band: 25–60 tokens depending on actual BD-list setup cost (IRON #93 has no live timing yet).

4. All-surfaces-parallel latency (512-in + 256-out)

stage	surface	time	note
Tokenize 512 in	CPU (1 core)	4 ms	1bit-core BPE, already measured
Prefill 512 tok	NPU	512 / 8 800 + 3 ms ≈ 61 ms	int2-via-int8 projection
RoPE tables + dispatch	CPU	<1 ms	precomputed, cached
Decode 256 tok @ L≈768	iGPU	256 / 220 ≈ 1.16 s	Sherry + actsparse + a4 + Medusa 1.7×
Detokenize + stream out	CPU	<5 ms	overlaps decode

TTFT ≈ 70 ms, total wall ≈ 1.23 s for 512+256. Today the same workload is ~4.1 s (prefill on iGPU, decode bottleneck). Projected wall-clock improvement: ~3.3× end-to-end.

5. End-to-end decode tok/s ceiling (2B model, L=64)

tier	shipping lanes	projection	assumptions
Conservative	2 of 7 (Sherry + fd-attn)	110–130 tok/s	no NPU, no Medusa, a4/actsparse deferred
Realistic	5 of 7 (Sherry + fd-attn + actsparse + a4 + Medusa)	190–240 tok/s	NPU still dark; decode is iGPU-only
Aspirational	7 of 7 + clean ROCm 7.2 driver	260–300 tok/s	NPU prefill frees iGPU for decode-only duty cycle; driver jitter <3%

The 280 tok/s headline number assumes (a) Sherry packs to 1.25 bits with no PPL regression >0.1, (b) BitNet v2 a4 KV is bit-exact vs fp16 KV on our re-pack path, (c) Medusa 1.7× accept rate on our workload (lower than MedusaBitNet's reported 2.3× because our baseline is already faster).

6. The bandwidth wall

Even at 280 tok/s decode with every lane shipping, the final ceiling is the LPDDR5-8000 controller. Derivation of the wall:

theoretical max = 256 GB/s × 0.92 util / (0.35 GB/tok best-case)
               ≈ 673 tok/s

Medusa multiplies tokens-per-weight-fetch by 1.7×, which is the only way past ~400 tok/s on this bus. Beyond that you need:

DDR6 (consumer, ~2027): 512–768 GB/s → linear 2–3× lift.
HBM3 (not coming to Ryzen AI APUs — Strix Halo's LPDDR5 is soldered).
On-die SRAM residency (Cerebras territory — not applicable to 2B

at batch=1).

For the 2026-04 Strix Halo box, 256 GB/s is the terminal ceiling. Everything we ship between now and DDR6 is bytes-per-token reduction.

Sources

Live measurements: Benchmarks.md, benchmarks/bench.sh,

benchmarks/ppl-gen2.sh, benchmarks/attn_fd.sh.

NPU hardware: IRON #55 (column count,

TOPS), IRON #93 (INT8 kernel), PR #94.

Model: Microsoft BitNet b1.58,

BitNet v2.

Memory notes: project_bitnet_live_bench.md, project_sherry_spike.md,

project_activation_sparsity_phase1.md, project_attention_fd.md, project_npu_path_analysis.md.

One-line homepage summary

Projected ceiling: ~280 tok/s decode + NPU prefill crossover at ~33 tokens once all seven lanes ship, at the 256 GB/s LPDDR5 wall.