Why 1.58-bit ternary?
One-line answer: 1.58-bit weights (values in {-1, 0, +1}) give a ~10× memory reduction over FP16 with near-zero accuracy loss on the BitNet-b1.58-2B-4T architecture. This lifts the LPDDR5 bandwidth bottleneck that dominates LLM inference on unified-memory systems like Strix Halo.
The bit budget
| format | bits/weight | 2B model size | LPDDR5 bandwidth demand |
|---|---|---|---|
| FP16 | 16 | 4.0 GB | at peak |
| INT8 | 8 | 2.0 GB | 0.5× |
| 1.58-bit (ternary) | 1.58 | 400 MB | ~0.1× |
On a Strix Halo box with 128 GB LPDDR5 shared between CPU and GPU, memory bandwidth, not compute, is the dominant bottleneck for LLM decode. A 10× smaller model = 10× less bandwidth demand per token.
The accuracy story
Microsoft's paper (arXiv:2402.17764) showed that 1.58-bit weights can match FP16 quality on a 3B model if trained from scratch with quantization-aware training, not post-quantized. The 2B-4T model they released confirmed it.
We use their pre-trained weights directly. Converting them to our .h1b format is a deterministic re-pack — we don't retrain.
Where "1.58" comes from
log₂(3) ≈ 1.585. Three values {-1, 0, +1} take 1.585 bits of information per weight in the information-theoretic sense. Storage is usually 2 bits per weight (one bit wasted per value), but Sherry's 3:4 sparsity encoding gets it down to 1.25 bits by forcing one of every four weights to be zero.
Why ternary and not binary?
Binary ({-1, +1}) = 1 bit per weight, 50% more memory savings on paper. In practice:
- Binary loses too much information — BitNet-b1.0 prototypes showed 2-3× higher perplexity vs the 1.58-bit variant at the same model scale.
- Zero is load-bearing — activation sparsity (ReLU+squared-ReLU) produces ~50-60% zero activations. Matching ternary weights with ternary-zero means those matmuls compile down to no-ops.
- Hardware efficiency — ternary maps to
xnor+popcount+ sign-flip on GPU. Binary loses the sign-flip branch.
Sub-1.58-bit formats (LittleBit 0.1 bpw, NanoQuant sub-binary) exist and are watched — see ../../.claude/projects/-home-bcloud/memory/project_bitnet_frontier_2026_04.md. Not production-ready for our scale today.
Why this matters for 1bit systems
On a $3 000 Strix Halo mini-PC with 128 GB unified LPDDR5 at 256 GB/s, ternary BitNet gets us:
- Model fits in 4 GB, leaving 124 GB for KV cache + activations — no OOM on long contexts.
- 83 tok/s decode at N=64, 68 tok/s at N=1024 — competitive with a datacenter GPU ~5× the cost.
- Silent / low-power — 150 W sustained. Whisper-quiet in a closet.
Same hardware running FP16 Llama-3-8B: ~18 tok/s. Ternary BitNet-2B at 83 tok/s is roughly 5× faster for comparable task quality on the benchmarks we care about.
Citations
- Wang, Ma et al. "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits." arXiv
2402.17764. - Wang, Ma et al. "BitNet v2: Native 4-bit Activations with Hadamard Transform for 1-bit LLMs." arXiv
2504.18415. microsoft/bitnet-b1.58-2B-4Ton HuggingFace — the weights we run.