1bit.systems

Ternary-on-AIE: packing & MAC plan

Design doc for how BitNet-b1.58-2B-4T weights (ternary, values {-1, 0, +1}) land on the XDNA 2 AIE2P tile array. Extends NPU-Kernel-Design.md. Read that first for the 8×4 tile grid + INT8 GEMM gotchas.

Packing on the host side (Rust, offline)

  1. Each ternary weight takes 2 bits. We pack 4 weights per byte, row-major. This runs once in the requantizer (tools/requantize/ternary.rs, same shape as the existing .h1b codec).
  2. Encoding: -1 → 0b10, 0 → 0b00, +1 → 0b01, 0b11 reserved (BitNet v2 uses it as a saturation sentinel).
  3. Per-layer scale stays bf16, stored alongside the packed block (one scale per row of 4096 for decoder weights; Microsoft's layout).
  4. Pre-tile for the AIE 4D DMA reorder pattern (design.py:368-375 in IRON). A naïve row-major blob costs cycles at shim-DMA time and we can just bake the reorder into the requantizer. One 4D transpose-in-advance, zero run-time cost.

Output of step 1 → a per-layer weights.bin consumed by step 2.

Unpack on the tile (Peano C++)

On AIE2P the int8 MAC pipeline wants int8 inputs for both A and B. We unpack 2-bit ternary to int8 inside the tile core, amortising over the MAC latency.

// inside aie_kernels/onebit_ternary_mm.cpp
// Lane = 32-wide vector; we run 4 lanes in parallel.
aie::vector<int8, 32> unpack_ternary(uint64_t packed) {
    aie::vector<int8, 32> out;
    #pragma unroll
    for (int i = 0; i < 32; i++) {
        uint8_t b = (packed >> (2 * i)) & 0b11;
        out[i] = (b == 0b01) ? 1 : (b == 0b10) ? -1 : 0;  // 0b11 → 0 (sentinel)
    }
    return out;
}

Cost: roughly one shift + one predicated-set per lane per pair, pipelined into the MAC issue slot. Theoretical hit: ~0 cycles if unpack and MAC issue on alternating slots (AIE2P VLIW has two VEC slots); worst case: 2-3 cycles added per 32-MAC block. Budget allows.

MAC core — adapted from mm.cc:83-208

Mirror matmul_vectorized_2x2_mmul but replace the A-side i8 load with our unpack:

Top 3 implementation gotchas (cribbed from the IRON analysis):

  1. 4D A-reorder BD (design.py:368-375): bake this into the packing step. The tile expects a specific sub-tile pre-order.
  2. Transpose-on-load for c_row_maj=false (mm.cc:135-145): easy to skip; produces bit-exact-wrong C. We want row-major C for compatibility with the router's hidden-state shape.
  3. Alternating shim placement on 8-col NPU2 (design.py:385): Tile(2*i, 1), not Tile(i, 1). Linear indexing double-assigns shim DMAs.

Bandwidth-vs-compute crossover

At ternary-packed int2, each weight byte yields 4 MAC cycles on-tile. For BitNet-2B hidden=2560 × hidden=2560 × 30-layer prefill:

Crossover: compute-bound at all prefill lengths we'll see. Bandwidth is not the ceiling for NPU prefill — compute is. This is why NPU is the correct prefill surface: iGPU is bandwidth-limited on the same problem, NPU is compute-limited, and we care about compute throughput at large M.

Activation path

Activations flow in as int8 from a stage upstream (CPU-side quantiser or the previous layer's output). Today's 1bit-server has bf16 activations end-to-end; for NPU prefill we add:

Both run on the iGPU (HIP), not the NPU. Overhead: ~0.3 ms per layer at 2B. Absorbed by overlapping with NPU MAC time.

Checklist (reimplementer)

  1. Requantizer output: packed ternary bytes + per-row bf16 scales + pre-baked 4D reorder.
  2. Peano C++ kernel source: unpack fn + MAC loop cribbed from mm.cc:83-208, A-path swapped for unpacked int8.
  3. Per-tile memory sized to <48 KiB of L1 (see NPU-Kernel-Design.md).
  4. Shim DMA bindings: 4 A-lanes (broadcast across rows), 8 B-lanes (broadcast across cols), 8 C-drain.
  5. Alternating shim placement on NPU2 (8 cols).
  6. Host-side: quantise bf16 activations → int8 + scale on iGPU, drain int32 → dequantise to bf16.
  7. xclbin produced by Peano, loaded via 1bit-xdna::XdnaDevice::load_xclbin.
  8. Test: bit-exact match against the iGPU HIP reference kernel on a fixed prompt.

Sources