1bit.systems

CPU Lane Plan

The 7th surface of 1bit systems's APU stack: the 16 Zen5 cores on Strix Halo, running the sampler + tokenizer + dispatcher in parallel with the iGPU's next-token grind.

This page covers what we scaffolded today (2026-04-20), why the CPU is a peer lane (not a fallback), and the three concrete steps that turn the scaffold into a measured win.

What the lane is

The CPU lane is host-side orchestration in parallel with GPU compute. On a live decode:

— and emits logits.

top-p / temperature sampler), detokenize the byte-level BPE delta, match stop strings, and hand the next id back to the GPU.

Gen-1 and current gen-2 run all of that on a single tokio task. Every microsecond the sampler spends on the critical path is a microsecond the iGPU sits idle waiting to be fed the next token id. The lane gives that work its own rayon-backed thread pool:

cores for the tokio reactor + HIP stream callback thread).

attribute it cleanly.

Surface entry point: [onebit_router::cpu_lane::CpuLane] in crates/1bit-router/src/cpu_lane.rs.

Why CPU is a peer lane, not a fallback

project_apu_thesis.md is the longer form; the short version:

NPU, and CPU at 256 GB/s peak. A CPU lane doing sampler work doesn't compete with the iGPU for bandwidth — the sampler's read pattern is a single 32k-element logit scan per token (~128 KB), fits in L2.

running something in parallel. iGPU grinds matmuls, NPU grinds prefill GEMMs, CPU grinds the sampler + tokenizer. Amdahl wins when nothing idles.

are HIP matmul, HIP attention, HIP sampler (today's path), XDNA prefill, XDNA decode, shared LPDDR5 DMA. The CPU lane replaces "HIP sampler" on the critical path so the iGPU can start the next token's matmul while the CPU is still finishing the previous token's sampler.

What we scaffolded today

2026-04-20:

holds a rayon::ThreadPool; parallel_sample(&self, logits, top_k, top_p, temp) demonstrates the rayon parallel-reduction pattern (argmax at temp=0, chunked top-k fallback at temp>0).

BackendError::CpuLaneStub("CPU sampler lane scaffolded, not yet on critical path; see docs/wiki/CPU-Lane-Plan.md") instead of unimplemented!(). Distinct error kind from NotYetWired so ops tooling can count them separately.

already in the workspace Cargo.lock transitively (via tokenizers); adding it directly doesn't grow the build graph.

argmax at temp=0 is deterministic on a 32k-element vector, HALO_CPU_THREADS=4 is respected. Env-var test takes a module-local mutex so it can't race with the HALO_BACKEND test in backend_config_tests.

What we did not do today, on purpose:

comes first — see step (a) below.

scaffolding — it proves the rayon pattern, not a drop-in replacement.

keeps orchestration in Rust; kernels stay in rocm-cpp.

What's next (three concrete steps)

(a) Wire CpuLane::parallel_sample into 1bit-server's SSE handler

1bit-server runs the sampler inline on the main axum task today. The wire-up:

  1. Construct a process-global CpuLane in 1bit-server's main via

onebit_router::cpu_lane::global_lane().

  1. In the /v1/chat/completions streaming handler, replace

sampler.sample(logits, history) with a dispatch that routes to lane.parallel_sample(logits, top_k, top_p, temp) when the CPU lane is selected. Keep the single-threaded path behind a HALO_CPU_LANE=off escape hatch.

  1. Ensure the SSE delta emission stays on the tokio task — rayon runs

the scan, tokio owns the socket.

(b) Microbenchmark single-threaded vs rayon-parallel sampler

Not wired = no measurement. Target: criterion bench in crates/1bit-router/benches/cpu_lane.rs:

128256 rounded down for easier chunking).

against CpuLane::parallel_sample at pool sizes 1, 4, 8, 14.

the iGPU overlap (~200 µs per token measured on a clean burn) wouldn't be dwarfed by rayon's per-task overhead.

If rayon loses: step (c) becomes rank #1 — a flat for loop on the tokio task beats rayon for a single 128 KB scan, and we'd need BLAS-level machinery to make the CPU lane pull its weight.

(c) Decide whether ZenDNN FFI is worth the dep weight

AMD-AI-ML-Tools-Scan.md flagged ZenDNN 5.2 as having an official llama.cpp backend. It's AMD's Zen-tuned math library — MLP / GEMM / reduction kernels AVX-512-VNNI-tuned for Zen4+. Two questions to answer before we pull it in:

  1. Is a parallel sampler + tokenizer even the bottleneck, or is the

iGPU forward pass eating all the time? (Step b's bench answers this.)

  1. Does ZenDNN offer anything for our sampler specifically? The library

is matmul-first; our sampler is a reduction. If the win is only on matmuls, the CPU lane is still Rust-native — we only FFI when the kernel class matches.

Decision output: either add ZenDNN as an optional feature-gated dep behind --features zendnn (same shape as --features real-xdna), or skip and stay pure Rust.

Cross-references

— the "why CPU is a peer lane" thesis.

target; the CPU lane closes surface #7.