Rotorquant default: OFF on 2B, opt-in via `--kv-quant rotor-3`
PlanarQuant-3 (2D-Givens rotation + 3-bit Lloyd-Max) landed as an optional KV-cache compression path in rocm-cpp:
src/kv_cache_attn_fd_rotor.hip— split-KV Flash-Decoding with inline
dequant + inverse Givens.
kernels/rotorquant_pack.hip— fp16 → 3-bit packed requantizer.
5.33× KV compression (fp16 16 B/elem → packed 3 B/elem, exactly). Numerically clean: mean_abs_diff ~ 0.02–0.05, max_abs_diff tracks ref_max_abs. Perf is the problem.
1. Current bench numbers (BitNet-2B-4T shape: NH=20, NKV=5, HD=128)
Source: /home/bcloud/claude output/rotor_bench-20260420T133827Z.txt (post wave-shfl V-loop fix). us_pq3_total = pq3_attn + pq3_requant (the requantize cost must be amortised because a freshly-written KV row must be packed before any subsequent read).
| seq_len | fp16 attn µs | pq3 attn µs | pq3 requant µs | pq3 total µs | fp16 tok/s | pq3 tok/s | ratio |
|---|---|---|---|---|---|---|---|
| 512 | 68.1 | 81.0 | 5.1 | 86.1 | 14 675 | 11 617 | 0.79× |
| 1024 | 88.8 | 92.3 | 5.0 | 97.4 | 11 263 | 10 272 | 0.91× |
| 2048 | 95.1 | 106.4 | 5.0 | 111.4 | 10 516 | 8 977 | 0.85× |
| 4096 | 184.7 | 208.0 | 5.0 | 213.0 | 5 413 | 4 696 | 0.87× |
We do not yet have 8k or 16k numbers. The bench harness caps at 4096 today. An 8k / 16k sweep is explicitly a prerequisite for flipping the default (see §4).
Pre-fix baseline for reference (rotor_bench-20260420T121449Z.txt, before the V-loop wave-shfl): 4096 was 2271 tok/s, i.e. the shfl fix doubled it. PQ3 is now consistently ~0.8–0.9× fp16 at the sizes we can measure. It does not overtake. The slope of fp16 attn µs/tok flattens with seq_len (the grid scales with num_kv_tiles), so the gap does not obviously close at 8k+ either.
2. Why it is slower despite 5.33× fewer bytes read
At HD=128, NKV=5, the pass-1 block does TILE=128 positions per tile. Per position the fp16 path streams 5 × 128 × 2 = 1280 B of K plus 1280 B of V from the KV cache. PQ3 streams 5 × 128 × 3/8 = 240 B of K plus 240 B of V. Net savings: roughly 2080 B/pos read-bandwidth.
Against that, PQ3 pays per-position, per-pair (pairs = head_dim / 2 = 64):
- 3-byte triple load + unpack — byte-wise loads (not coalesced as
cleanly as the fp16 ld_global_u16x2), shift assemble into u32, then two 3-bit extract-and-mask operations.
- LUT dequant — one LDS read per scalar into a scaled 8-entry LUT.
Bank-conflict-free (8 entries replicated), but an extra LDS round-trip per scalar vs the fp16 path which reads K/V directly.
- Inverse Givens rotation — 2 fmas + 2 muls per pair, per position,
on both K and V. Plus an LDS read of s_pair_cs[p][{0,1}] per pair.
- V-side wave shfl — the recent fix; saves vgpr on the V accumulate
but still leaves the dequant+unrotate on the critical path.
At gfx1151's LPDDR5 bandwidth (~256 GB/s, ~77–80 GB/s realized to iGPU), the fp16 attn decode kernel is not memory-bound at these shapes — we sit well under the roofline because the per-position compute is tiny and the launch/grid/online-softmax overhead dominates below ~2k. So trading bandwidth for compute loses at HD=128/NKV=5. The 5.33× compression doesn't buy us anything we need; the inverse-rotation is pure overhead.
This would flip at large NKV × HD × seq_len where the KV read genuinely saturates LPDDR5 — i.e. 70B-class or 32k-context regimes.
3. Recommendation
Default rotor-3 OFF on BitNet-2B-4T and any ≤4B model. Keep the path fully built and wired, expose it behind --kv-quant rotor-3 on the CLI (and a matching field in the router config). Flip the default ON only when we ship either:
- A 70B-class model where fp16 KV exceeds working-set headroom in
128 GB LPDDR5 under realistic multi-session load, or
- A 32k (or longer) context variant of any model where fp16 KV
crosses the point at which the attn kernel becomes bandwidth-bound (i.e. us_attn scales linearly with seq_len and hits roofline).
This matches how the stack is used today (2B short-context on a single box) while keeping the compression path alive for the two futures that actually need it. Numeric parity is already good enough for opt-in; we are not shipping a correctness risk, only a perf trade.
4. What to measure before flipping the default
Hard gates, all on gfx1151, bench.sh-style harness, 3×20-iter rounds:
- 8k and 16k sweeps — extend
rotor_benchtoseq_len ∈ {8192, 16384}.
PQ3 needs to beat fp16 by ≥1.1× at some seq_len to justify a default flip for long-context.
- 70B shape microbench — re-run the microbench at (roughly) NH=64,
NKV=8, HD=128 to mimic a 70B LLaMA-family layer. Confirm the bandwidth crossover appears in that shape.
- Service-level bench — end-to-end tok/s via
bitnet_decode(not
just the kernel microbench) at 4k / 8k / 16k prompt + 1024 gen. The microbench excludes the K/V write-path requantize called on every new token; we need to confirm the amortisation assumption holds.
- PPL on wikitext-103 at 1024 ctx — fp16 baseline 9.1607. PQ3 must
stay within ≤0.25 absolute PPL to be default-eligible (current microbench numeric error suggests this is comfortable, but it has not been measured end-to-end).
- rocprof counters — confirm
SQ_WAVES/VALU_UTILfor PQ3 is
the bottleneck (compute-bound today) and that it inverts at the target shape (becomes bandwidth-bound).
- Working-set headroom — on the target model, verify that fp16 KV
- activations + weights actually spill out of the iGPU's 96 GB
partition under realistic concurrency. If we comfortably fit, the compression isn't forced and we should lean on capacity, not compression, for now.
Only when (1) + (2) + (3) all show PQ3 > fp16 on service tok/s and (4) confirms PPL parity do we flip the default. Until then: opt-in, behind the flag.
Contact on questions: project_bitnet_live_bench.md, project_bitnet_rocprof_plan.md.