1bit.systems

Why Strix Halo (gfx1151)?

One-line answer: AMD Ryzen AI MAX+ 395 ships 128 GB of LPDDR5 shared between CPU and iGPU at $2–3k total, with 256 GB/s bandwidth — enough to run a ternary 2B model at 80+ tok/s without touching discrete VRAM or cloud. No other consumer-class box hits that price × memory × bandwidth point.

The hardware

Why it's special for ternary inference

Unified memory. The 128 GB LPDDR5 is addressable by both CPU and iGPU with zero PCIe copy. On a discrete-GPU box, a ternary 2B model's 400 MB lives in VRAM and the CPU needs PCIe to touch it. On Strix Halo, the GPU reads straight from the same DDR bank the CPU just wrote to. At ternary bitrates we don't need more bandwidth than LPDDR5 provides — we have bandwidth to spare.

The numbers we care about

resourceavailableternary 2B usageheadroom
memory128 GB4 GB model + 500 MB KV @ N=4096123 GB
bandwidth256 GB/s~240 GB/s at decode (92% peak)bandwidth-bound
compute~60 TFLOPs FP16~6 TFLOPs used10× headroom
power150 W~100 W under decodelow-noise cooling

The bottleneck is memory bandwidth, not compute. This validates the ternary story: make weights smaller, everything gets faster.

Why not a discrete RTX / Radeon GPU

Why not an Apple M4 Max / Ultra

Apple's memory bandwidth is higher (M4 Max is 546 GB/s), memory caps similar (128 GB). Real reasons we picked AMD:

We feature-gate a mlx-apple path in 1bit-mlx so the workspace still compiles and runs on M-series for developers who work cross-platform. But AMD is the performance target.

Why gfx1151 specifically

RDNA 3.5 iGPU. The gfx1151 ISA level gives us:

Software stack on the box

Community

Early — Strix Halo launched mid-2025. We're one of a handful of projects publicly running native HIP ternary kernels on gfx1151. The niche is intentional; see project_halo_vision.md for the "silent-closet BYOA inference" thesis.

Citations