NPU-Kernel-Design
Design notes for the straight-C++/Peano reimplementation of IRON's INT8 GEMM, targeting the Strix Halo XDNA2 NPU for BitNet prefill. Source reference: amd/IRON @ devel (cloned 2026-04-20 to /tmp/IRON-read).
AIE2P hardware layout (Strix Halo)
| Property | Value | Source |
|---|---|---|
| Rows per column | 4 compute tiles | iron/operators/gemm/design.py:151 (n_aie_rows = 4) |
| Columns on Strix Halo | 8 | design.py:209-212 ("NPU2 (Strix/Strix Halo/Krackan) has 8 columns"), IRON #55 |
| Shim row | row 0 | design.py:325 (tiles indexed [col][row], shim placements use row=0) |
| Mem-tile row | row 1 | design.py:384,408,433 (all Tile(col, 1) placements) |
| Core rows | rows 2-5 | design.py:326 (core_tiles = tiles[2:]) |
| Vector store width | 512 bit | aie_kernels/aie2p/zero.cc:22 (r = 512 / (sizeof(T)*8)) |
| INT8 MAC shape | 8 x 8 x 8 (r,s,t) | aie_kernels/aie2p/mm.cc:366-378, confirmed IRON #93 body |
| Peak INT8 | ~50 TOPS array | IRON #55 xrt-smi validate + andrej comment |
| bf16 : int8 ratio | ~1:8 (not 1:2 like AIE-ML) | IRON #55 comment 3706297989 (andrej) |
The 50 TOPS figure comes straight from xrt-smi validate in IRON #55. AngryLoki in that thread notes 1.8 GHz x 512 INT8 MACs/cycle/core x 32 cores ≈ 58.8 TOPS theoretical; IRON's 64x64x64 design hits ~8% of that.
INT8 per-core tile layout
The kernel is a single template: matmul_vectorized_2x2_mmul at aie_kernels/aie2p/mm.cc:83-208. It uses aie::mmul<r,s,t,T_in,T_in,accauto> and unrolls 2x in M and 2x in N, giving 4 accumulator registers (C00, C01, C10, C11 at mm.cc:147-150) that stay resident across the K reduction.
For INT8 the wrapper is matmul_vectorized_8x8x8_i8_i32 at mm.cc:396-410 — r=s=t=8, so each outer 2x2 step lands a 16x16 block of C from two 8-wide A rows and two 8-wide B columns. The K-reduction loop is for i in colA at mm.cc:152-177; each iteration issues 4 mac ops (mm.cc:173-176).
Per-core L1 buffers (from design.py:277-279, default 64x64x64, i8 in / i32 out):
- A:
m*k = 64*64 = 4096 B, ping-pong x2 = 8 KiB - B:
k*n = 64*64 = 4096 B, ping-pong x2 = 8 KiB - C:
m*n*4 = 16384 B(i32), ping-pong x2 = 32 KiB. Total ~48 KiB of 64 KiB.
PR #94 bumps INT8 min tile to (16,8,16) (op.py diff, line ~66) because the 2x r and 2x t unroll requires m % 16 == 0, n % 16 == 0 — static_assert at mm.cc:372-374.
DMA descriptor pattern
One sentence: A broadcasts across columns and distributes 4-ways across rows from 4 shim DMAs; B broadcasts across rows and distributes 8-ways across columns from 8 shim DMAs; C is joined 4-to-1 per column and drained through 8 shim DMAs, all on a 3-level L3 -> L2 -> L1 ObjectFifo pipeline with fifo_depth=2 ping-pong.
L3->L2 fifos (design.py:360,395): one per shim-mem pair, carry a mem_tile_m_A x k = 256x64 slab of A and a k x n = 64x64 slab of B.
L2->L1 split/forward (design.py:376-388, 400-409): .split() on A with dims_to_stream = [(m/r, r*k), (k/s, s), (r, k), (s, 1)] performs the r-by-s tile reordering on the mem-tile DMA — this is the 4D strided descriptor that Peano HW BD regs need to replicate. B uses [(k/s, s*n), (n/t, t), (s, n), (t, 1)] (row-major, design.py:399) or the b_col_maj variant at design.py:397.
C drain (design.py:413-421): mem-tile runs the inverse reorder [(m/r, r*n), (r, t), (n/t, r*t), (t, 1)], joining 4 rows of 16x16 C fragments into one 64x64 tile that streams back to shim.
Runtime-side npu_dma_memcpy_nd descriptors for the full M,K,N walk are built by TensorTiler2D.group_tiler / step_tiler at design.py:517-545 and dispatched in a double-buffered ping-pong at design.py:569-577.
Stream-switch configuration
Each ObjectFifo lowers to a pair of stream-switch circuits:
- shim DMA <-> mem tile: 4 A channels + 8 B channels + 8 C channels =
20 shim->mem circuit-routed streams (design.py:360, 395, 416).
- mem tile <-> compute tile: A uses
.split()-> 4 streams/column-group
(design.py:379); B uses .forward() -> 1 stream/column (design.py:403); C uses .join() -> 4 streams/column (design.py:428).
No tile-to-tile (cascade/LDM) streams; all compute cores pull from mem tiles. This means stream-switch is bipartite shim<->mem<->core only — simpler than a cascade design, which is good news for hand-rolling the MLIR.
RTPs at design.py:339-350 are 2x int32 per core (rtp_K_div_k, rtp_n_tiles_per_core), written by runtime via use_write_rtp=True.
Ternary (int2) adaptation
Weights land in L1 as packed int2 (2 bits/elem, 4 elems/byte). Required on-tile work before the MAC pipeline:
- Decompress int2 -> int8 into a scratch B tile. AIE2P has a 512-bit
vector shuffle; aie::unpack / aie::shift_bytes on a 128-elem int2 word give 128-elem int8 in ~4-8 cycles.
k*n = 64*64 = 4096i8 outputs per B tile -> 32 vector unpacks per K
tile.
- The shuffle unit runs on the VPERM slot; MAC runs on MUL/MAC slot. On
AIE2P these are separate issue slots, so the decompress should overlap with the mac chain at mm.cc:173-176 as long as we double-buffer the decompressed B.
- Cost model: 32 unpacks/tile vs 8 mac/iter x 8 K-iters = 64 macs. Unpack is
~50% of MAC count — fits in the VPERM slot without stealing MAC cycles IF we decompress into a second B' scratch while the previous B' drives the MAC.
- Storage hit: B' scratch is another 4096 B (1 KiB) -> push per-core L1 from
~48 KiB to ~49 KiB, well inside 64 KiB.
Alternative: keep weights as int2 and build a LUT-mpGeMM kernel (vlut.cpp-style). Defer — one-shot unpack-to-int8 reuses IRON's MAC pipe unchanged.
Reimplementation checklist (>=8 items)
- Port
matmul_vectorized_2x2_mmul(mm.cc:83-208) byte-for-byte; it is
the hot loop and Peano's AIE-API headers match.
- Port
zero_vectorized(aie_kernels/aie2p/zero.cc:20-31) using
aie::zeros<int32,16>() for the i32 accumulator tile.
- Replace the ObjectFifo python with hand-written MLIR for 4 A, 8 B,
8 C circuits mirroring design.py:358-437.
- Encode the L2->L1 A reorder 4D BD pattern from
design.py:368-375
((m/r,r*k),(k/s,s),(r,k),(s,1)) directly into shim+mem BD registers.
- Encode the L2->L1 B reorder from
design.py:397-399— both
b_col_maj and row-major variants.
- Encode the C join inverse reorder from
design.py:413-415
((m/r,r*n),(r,t),(n/t,r*t),(t,1)).
- Implement the TensorTiler2D walk from
design.py:517-545as a
runtime BD list; ping-pong tb_max_n_rows=4 per design.py:514.
- Ship an int2->int8 unpack prologue ahead of the MAC loop at
mm.cc:152; keep ping-pong scratch so VPERM overlaps MAC.
- Mirror the kernel flag macros
-Di8_i32_ONLY,-DDIM_M/K/Nfrom
op.py:115-132 (PR #94 diff) as compile-time constants in Peano build.
- Replicate the 2x int32 RTP slot (
design.py:339-350) so runtime can
pass K_div_k and n_tiles_per_core to cores.
Source file map
| IRON path | Reimplementation target |
|---|---|
aie_kernels/aie2p/mm.cc:83-208 | kernels/aie2p/mm_core.cc |
aie_kernels/aie2p/mm.cc:396-410 | kernels/aie2p/mm_i8_i32.inc |
aie_kernels/aie2p/zero.cc:20-31 | kernels/aie2p/zero.inc |
iron/operators/gemm/design.py:358-437 | host/gemm_fifos.mlir |
iron/operators/gemm/design.py:517-761 | host/gemm_runtime.cc |
iron/operators/gemm/op.py:115-150 (PR#94) | host/gemm_build.cmake |
Sources:
- IRON repo @
devel: https://github.com/amd/IRON - Issue #55 (column count, TOPS): https://github.com/amd/IRON/issues/55
- Issue #93 (INT8 request): https://github.com/amd/IRON/issues/93
- PR #94 (INT8 wiring): https://github.com/amd/IRON/pull/94