v0.1 · docs
1bit.systems · v0.1

Local ternary inference on consumer AMD silicon.

1bit.systems is a native C++ and Rust stack for sub-2-bit language, speech, and image models on AMD Strix Halo. No Python at runtime. No discrete GPU. No cloud dependency.

This is the full documentation, the full history, and the full roadmap on one page. Scroll, Ctrl+F, or bookmark a section. No separate blog, no separate benchmark report. If a thing exists about the project, it should be here.

Heads up — the XDNA 2 NPU on Strix Halo is not yet accessible to this project. AMD has not shipped a Linux execution provider for Strix Halo. Everything here runs on the integrated GPU. See NPU status.

What it is

A serving stack built around 1.58-bit ternary weights. Kernels are hand-written HIP targeting gfx1151. Everything above the kernels — OpenAI-compatible HTTP surface, MCP bridge, session state, sampler — is Rust. A 70B-parameter model fits in 128 GB of unified memory at 1.58 bits per parameter, which is the whole reason this is possible on a mini-PC.

The identity is unified as 1bit — one name for the brand, the codebase, the install path, and the binaries. Earlier references to halo-ai in git history point to the same project under its prior name.

Who it's for

  • People who want to run modern AI on hardware they own, with no cloud dependency.
  • Researchers interested in sub-2-bit inference on consumer silicon.
  • Developers who need a local LLM backend and don't want to ship a Python runtime.
  • Writers, artists, and tinkerers who need privacy by topology — weights and prompts never leave the room.
· · ·

Architecture

Hybrid C++ and Rust stack, layered from kernel to HTTP surface. Clients speak OpenAI-compatible HTTP and don't need to know what's underneath.

┌──────────────────────────────────────────────────────────┐
│  Clients · Open WebUI · MCP · CLI · Helm (planned)       │
└─────────────────────────┬────────────────────────────────┘
                          │  OpenAI-compatible HTTP
┌─────────────────────────▼────────────────────────────────┐
│  1bit-halo-server  (Rust, axum)                  :8180   │
│    Router · Sessions · Sampler · Token streamer          │
└─────────────────────────┬────────────────────────────────┘
                          │  FFI
┌─────────────────────────▼────────────────────────────────┐
│  bitnet_decode  (C++20, HIP)                     :8080   │
│    Ternary GEMV · Split-KV FD attention · RoPE           │
│    RMSNorm · SiLU · KV cache · Tokenizer · Sampler       │
└─────────────────────────┬────────────────────────────────┘
                          │  HIP
┌─────────────────────────▼────────────────────────────────┐
│  Radeon 8060S  ·  gfx1151  ·  40 CU  ·  wave32 WMMA      │
└──────────────────────────────────────────────────────────┘

Every layer is native. Rust handles orchestration (HTTP, sessions, scheduling, streaming). C++20 + HIP handles kernels and model state. No Python in the serving path at any layer.

Constraints

Rule A — no Python at runtime

Hard rule. No Python interpreter in any serving binary. Caller-side tooling is any language you want; the serving path is not.

Rule B — C++20 for kernels, Rust for orchestration

Default language for a new component is C++20 if it talks to HIP directly, Rust otherwise. Rust gets the safety and ownership guarantees where correctness matters most; C++ gets HIP intrinsics, wave32 WMMA, and the register-level control that the ternary GEMV depends on.

No runtime hipBLAS

Native Tensile-generated kernels are allowed; runtime hipBLAS is banned because its heuristic collapses on the skinny ternary GEMV shape the models use.

Kernels: overview

KernelRole
ternary_gemvPacked ternary × FP16 GEMV · hot path
attention_fdSplit-KV Flash-Decoding · per-head parallel
ropeRotary position embedding · HF split-half
rmsnormRoot-mean-square normalization
siluActivation · SwiGLU companion
kv_cacheKV-cache append + retrieval

All live in rocm-cpp/src/ and rocm-cpp/kernels/. HIP C++ with wave32 WMMA intrinsics, tuned for gfx1151. A gfx1201 port for RX 9070 XT is in flight.

Ternary GEMV

Weights are stored as packed ternary values {−1, 0, +1} with a per-tensor FP16 scale. The kernel reads 2 bits per weight, multiplies by the FP16 activation, accumulates in FP32, and writes back FP16.

Current: 92% of LPDDR5 peak bandwidth on gfx1151. Memory-bandwidth-bound, not compute-bound. Reducing bytes-read per token is the #1 speedup lever; sub-1-bit formats are the research priority.

Attention · split-KV Flash-Decoding

Standard Flash-Decoding adapted for gfx1151 wave32. Each attention head splits its KV range across multiple thread-blocks; softmax statistics are combined with a log-sum-exp merge; output is written once per head.

Landed 2026-04-19 as the default in bitnet_decode. Speedup: 6.78× at context length 2048, bit-exact against the reference path.

RoPE · convention fix

Rotary Position Embedding carried a convention mismatch until 2026-04-19. The implementation used interleaved rotation; Hugging Face canonical models use split-half rotation. Fix was a six-line diff.

metricbeforeafter
PPL · wikitext-103524~12
PPL · repeated text4.291.04

Quant formats

FormatBits/weightStatus
BitNet 1.581.58Shipped (Halo v2)
TriLM1.58Shipped (experimental)
Sparse-BitNet · 3:41.25Retraining
BitNet v2 · W1.58 A41.58 W + 4 AWatching
LittleBit0.1Watching

Features

OpenAI-compatible HTTP

/v1/chat/completions, /v1/models, SSE streaming, bearer auth optional. Any OpenAI SDK works out of the box — point base_url at http://localhost:8180/v1.

Session-aware KV cache

Conversations keyed by X-1bit-Session header. KV cache is pinned per session, avoiding re-prefill on multi-turn threads.

MCP introspection

1bit-halo-mcp exposes model listing, health probes, KV-cache stats, sampler overrides, and kernel timing as Model Context Protocol tools. Attach from Claude Desktop, Claude Code, or any MCP client.

Local by topology

Every byte stays on the machine. No telemetry, no dial-home, no usage analytics. Caller-side clients may hit cloud APIs by choice; the serving path does not.

· · ·

Halo v2 · BitNet 1.58

  • 2B parameters · Microsoft's public BitNet release
  • 1.58-bit weights · FP16 activations
  • Served by bitnet_decode on :8080 and 1bit-halo-server on :8180
contexttok/s
64 tokens66
1024 tokens33

Clean burn numbers from 2026-04-18, post-RoPE fix. Memory-bandwidth-bound across the whole context range.

TriLM

3.9B parameters, Apache 2.0, from the SpectraSuite (TriLM_3.9B_Unpacked). LLaMA architecture, ternary-trained from scratch. Used as a smoke-test model and as the NPU export candidate.

Sparse-BitNet

Retrain in progress on an H200 pod. Target: 1.25 effective bits per weight via 3:4 N:M sparsity layered on 1.58-bit weights.

Run 3 bailed at step 500 on a false-positive mask-integrity check (empty mask cache due to disabled monitoring). Run 4 launched 2026-04-22 with the patch — model.enable_mask_monitoring() at init, mask_cache.clear() after each verify, --save-every 100 to avoid losing progress to future bails.

Pre-bail numbers on Run 3 tracked cleanly: loss 11.04 → 5.77 across steps 50 → 500, throughput steady at 49.5k tok/s. Run 4 first checkpoint landed at step 100. 10B-token gate ETA ~57 h.

· · ·

The stack

  • Host OS: CachyOS (Arch-family, rolling) on Btrfs + snapper + limine
  • Kernel: 6.18.22-lts — pinned after an amdgpu OPTC hang on 7.0
  • ROCm: 7.x built from source against gfx1151 (not on ROCm's Tier-1 list)
  • Kernels: C++20 + HIP · wave32 WMMA · zero runtime hipBLAS
  • Orchestration: Rust 2021 · Cargo workspace · axum + tokio
  • Edge: Caddy reverse-proxy for bearer auth and TLS
  • Supervision: systemd (one unit per binary)

The machine

  • CPU: Ryzen AI Max+ 395 · Zen 5 · 16 cores / 32 threads
  • GPU: Radeon 8060S · 40 CU RDNA 3.5 · gfx1151 · wave32 WMMA
  • NPU: XDNA 2 · 50 TOPS claimed · inaccessible — see NPU status
  • Memory: 128 GB LPDDR5X-8000 · unified · ~270 GB/s peak
  • Power: 45–120 W configurable TDP envelope

A secondary RDNA 4 target (RX 9070 XT on gfx1201) lives in the ryzen mesh node. A fat-binary build covering both arches is wired in rocm-cpp; the gfx1201 WMMA intrinsic port is in flight.

Serving

1bit-halo-server is Rust + axum, serving an OpenAI-compatible surface on :8180. It forwards to bitnet_decode over FFI, manages session state, streams tokens, and enforces per-session rate limits.

PortServicePurpose
8080bitnet_decodeInternal · upstream for 1bit-halo-server
81801bit-halo-serverOpenAI-compatible HTTP · public-facing
81811bit-halo-mcpMCP server · tool & introspection surface
81901bit-halo-whisperStreaming STT (planned)
81911bit-halo-kokoroTTS (planned)

MCP

1bit-halo-mcp exposes introspection and control as a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Claude Code, Continue, Cursor, custom agents) can attach and call tools for: model listing, health probes, KV-cache stats, active session inspection, sampler overrides, and kernel timing.

22 tests cover the surface. Canonical since 2026-04-19.

NPU status

Gatekept — Strix Halo ships with an XDNA 2 NPU, but the software path is not available to this project today. The project runs on the iGPU until one of the blockers below clears.

No Linux execution provider for STX-H

AMD's Ryzen AI 1.7 Linux stack supports Strix Point (STX) and Krackan (KRK) only. Strix Halo (STX-H) has no Linux execution provider. The Windows stack exposes the NPU through a proprietary VitisAI provider that is Windows-only.

Quant format mismatch

AMD's Ryzen AI model collections ship UINT4-AWQ weights with BFP16 activations. No ternary kernel. No 1.58-bit compile path. MatMulNBits with N=4 is the only shape the AIE control-packet graph compiler accepts today.

Kernel authoring is gated

Writing native AIE kernels requires Riallto (Phoenix-only, Ubuntu 24.04.2 + Docker + paid Xilinx license, zero GEMM kernels shipped). Custom ternary kernels on XDNA 2 would need to be authored from scratch against this toolchain.

Current verdict

Defer. Revisit when AMD ships STX-H on Ryzen AI Linux, or when a community BitNet backend for XDNA 2 lands in public. Nothing on the near- or mid-term roadmap depends on the NPU shipping.

· · ·

Roadmap

  1. Sparse-BitNet retrain completion (H200 pod · 10B-token budget)
  2. BitNet v2 implementation — Hadamard-native W1.58 A4
  3. MedusaBitNet speculative heads — expected 1.4–1.8× at batch = 1
  4. gfx1201 WMMA intrinsic port for RX 9070 XT (second target)
  5. Streaming STT via halo-whisper sentence-boundary partials
  6. Image generation via sd.cpp native-HIP port, SDXL on gfx1151
  7. Video generation port — Wan 2.2 TI2V-5B (5B DiT, Apache 2.0)
  8. Desktop shell — voice-first, plugin API via MCP, package manager
  9. NPU unblock — gated on AMD

Changelog

  • 0.1.6 · 2026-04-22 — Sparse-BitNet Run 4 launched. Kernel rolled back to 6.18.22-lts. Network topology documented. gfx1201 build variant wired.
  • 0.1.5 · 2026-04-21 — TriLM INT4 ONNX export complete. NPU placement confirmed blocked. Six-crash investigation and kernel-7.0 rollback plan.
  • 0.1.4 · 2026-04-20 — Bare-metal-first lock-in. AMD and AMDResearch org scan. XDNA 2 defer verdict.
  • 0.1.3 · 2026-04-19 — RoPE split-half fix (PPL 524 → ~12). Split-KV Flash-Decoding attention (6.78× at L=2048). PPL harness landed.
  • 0.1.2 · 2026-04-18 — Sherry 1.25-bit spike committed. sd.cpp native-HIP port promoted to core.
  • 0.1.0 · 2026-04 — bitnet_decode online. Halo v2 first responding. ROCm 7.x system build against gfx1151.

Contact

Project source lives at github.com/bong-water-water-bong/1bit-systems. Open issues, discussions, and pull requests all welcome.

A Discord server exists for live discussion; the invite URL opens when the server is ready for wider links. A Patreon surface underwrites compute time (training runs, H200 pod hours, retrains) and opens when public channels open.

— end of document —