1bit.systems

Why no NPU yet?

2026-04-20 RULE A CONSTRAINT — no Python in NPU path

User directive 2026-04-20: zero Python, build-time or runtime. That rules IRON as an authoring path (it's Python+C++). IRON stays as a reference-only read — we crib its MLIR tile layouts + DMA descriptors, then reimplement the kernel in straight C++ against Peano.

Production path when we start:

LayerChoiceWhy
AIE kernel sourceC++ via PeanoLLVM-AIE backend, direct AIE VLIW codegen, pure C++
Build orchestrationCMake + Peanonot IRON's Python build.py
Runtime dispatchlibxrt C++ (xrt::kernel, xrt::bo)loads xclbin, manages DMA, no Python
Kernel driveramdxdnaupstreamed Linux 6.10+, already works
Rust FFInew 1bit-xdna cratemirrors 1bit-hip shape

Same discipline as rocm-cpp today: C++ kernels, Rust above, no interpreters anywhere. The IRON examples at programming_examples/basic/matrix_multiplication/ and ml/bert/ are maps, not tools.

2026-04-20 UPDATE: IRON / MLIR-AIE lead from AMD Discord

Geramyl (AMD mod) pointed us at a fifth path we had not scored: iron (and "the other one" — confirmed as mlir-aie). We now know:

Strix Halo status today: PARTIAL — working this week for at least one user.

Evidence from IRON issue tracker (searched 2026-04-20):

What this changes.

Honest framing: this is a fifth evaluation track, not a green light to ship NPU. We are thanking AMD for the pointer and spending one week to confirm the reproducibility of issue #55's 2.75 TFLOPS number on our box. Everything below remains correct for the Ryzen AI SDK / FastFlowLM / IREE / ONNX paths.


One-line answer: evaluated four stacks (ONNX-RT + Vitis EP, FastFlowLM, IREE-AIE, direct xrt). Deferred — no path runs BitNet-b1.58 on Strix Halo's XDNA 2 in Linux today, and the realistic decode ceiling is below our current iGPU. Update posture: quarterly passive monitoring, not active work. See 2026-04-20 update above for path #5 (IRON) now under active evaluation.

What's on the box

Why we haven't used it yet

  1. No Rust-native BitNet NPU runtime exists publicly. The mature options are Python (AMD Ryzen AI SDK), and those target generic ONNX models — BitNet's ternary matmul is a custom op that standard ONNX runtimes don't know. Someone has to write it; so far nobody has, on this hardware, in public.
  1. iGPU is already bandwidth-bound, not compute-bound. Our ternary GEMV hits 92% of LPDDR5 peak on the weight-read path. Adding NPU compute doesn't create new bandwidth. The NPU would share the same LPDDR5; it wouldn't make decode faster, only change which silicon the math lands on.
  1. NPU's real win is prefill + speculative, not decode. At prefill (M > 1 matmul instead of M = 1 GEMV) the NPU's int8 matrix units would be fed more efficiently. Same for Medusa-style speculative decoder heads — multiple candidate tokens in parallel is exactly what the NPU is designed for.

The two stacks we're evaluating

Stack A — ONNX Runtime + Vitis/AMD NPU execution provider

Shape: Export BitNet to ONNX with a custom ternary-matmul op. Use ONNX Runtime's AMD EP to route that op to XDNA 2.

Pros:

Cons:

Stack B — FastFlowFM / Ryzen AI Flow

Shape: AMD's newer model-format + compiler, supposedly targets GPU+NPU dispatch with a unified IR.

Pros:

Cons:

Stack C — IREE AMD AIE

Shape: IREE compiler pipeline with the amd-aie backend.

Pros: open source end-to-end, MLIR-based, Rust bindings via iree-compiler crate.

Cons: bleeding edge; BitNet model support unknown; perf not characterized on Strix Halo.

Final verdict (2026-04-20)

Research concluded — see project_npu_path_analysis.md memory. Defer until one of:

  1. Ryzen AI SDK ≥ 1.8 adds STX-H (Strix Halo) to its Linux-supported SKU list. Today's 1.7.1 (April 2026) lists only STX + KRK.
  2. microsoft/BitNet ships an XDNA backend. Issue #408 — "Intel & AMD NPU support?" — has been open since Feb 2026 with zero Microsoft replies.
  3. FastFlowLM open-sources its NPU kernels (currently closed-source, non-redistributable under their EULA) or adds a 1.58-bit model family.
  4. Our iGPU path saturates the 212 GB/s LPDDR5 ceiling. Today we run at ~15% utilization — plenty of Sherry / activation-sparsity / KV-compression runway.
  5. A third party publishes a working BitNet → AIE kernel in public.

The four stacks, scored

Stack A — ONNX Runtime + Vitis AI EP

Stack B — FastFlowLM (note: "FastFlowFM" was a transcription artefact; real name is FastFlowLM)

Stack C — IREE AMD-AIE (nod-ai/iree-amd-aie)

Stack D — xrt direct

The decode-tok/s math

XDNA 2 bandwidth: ~120 GB/s (dual-channel LPDDR5). iGPU bandwidth: ~212 GB/s of the 256 GB/s pool (our measured ceiling).

BitNet-2B decode is memory-bandwidth-bound on the weight-read path. Linear scaling on the T-MAN paper's 50 tok/s result on Qualcomm Hexagon (~77 GB/s) gives XDNA 2 a ~77 tok/s ceiling — below our measured 83 tok/s on the iGPU today.

Decode-on-NPU is negative ROI. The compute is there (50 TOPS int8), the bandwidth isn't. This is the same memory-bound story the ternary choice already exploits — see Why-Ternary.md.

What could still pay

Prefill-on-NPU (tier a) only. At prefill (many tokens at once), the matmul is dense enough that compute matters more than bandwidth. NPU's 50 TOPS int8 could roughly double prefill throughput on 128-token prompts.

That's a nice-to-have, not a cutover gate. And it still requires the AMD Linux SKU gap to close first.

Status

Memory pointer: project_npu_path_analysis.md has the full comparison table, citations, and defer-until conditions.