Local ternary inference on AMD Strix Halo.
Native C++ and HIP kernels below, Rust orchestration above. Zero Python at runtime. Zero cloud dependency.
A 70-billion-parameter model fits in 128 GB of unified memory at 1.58 bits per weight. That is the whole reason this is possible on a mini-PC.
website · https://1bit.systems/ · source · https://github.com/bong-water-water-bong/1bit-systems
What works today
- Halo v2 — 2B BitNet 1.58 serving at 66 tok/s on 64-token context, 33 tok/s at 1024-token context. Measured on the reference Strix Halo box, no asterisks.
- Ternary GEMV at 92% of LPDDR5 peak bandwidth on
gfx1151. Memory-bandwidth-bound — reducing bytes per token is where the next speedup lives. - Split-KV Flash-Decoding attention — 6.78× at context length 2048, bit-exact vs reference.
- OpenAI-compatible HTTP on
:8180. AnyopenaiSDK, Open WebUI, DSPy, or Claude Code MCP client points at it and it just works. - MCP surface on
:8181for introspection, KV-cache stats, sampler overrides. - Zero telemetry, zero dial-home, zero cloud. Weights and prompts never leave the machine.
What's in flight
- Sparse-BitNet retrain on an H200 pod — targeting 1.25 effective bits per weight via 3:4 N:M sparsity on top of 1.58-bit. Run 4 live. 10B-token budget, ~57 h wall-clock.
- BitNet v2 implementation (Hadamard-native W1.58 A4) planned next.
- gfx1201 build variant for RX 9070 XT — second hardware target. ROCm 7.2.2 live on the second box. WMMA intrinsic port still in progress.
- Desktop shell — voice-first, plugin API via MCP, package manager.
Glass walls — the honest part
If something broke, it is on the page.
- Sparse-BitNet Run 3 died at step 500 on a three-line trainer bug — mask monitoring was off by default so the integrity check always saw an empty cache and fired a false-positive bail. 524M tokens of H200 time lost. Full autopsy on the site. Run 4 is the patched relaunch.
- RDNA 4 WMMA port is not a flag flip — the intrinsics changed family between gfx11 and gfx12. Real work, tracked in the open.
- amdgpu OPTC CRTC hang on kernel 7.0 for gfx1151 — Wayland freezes hard under concurrent model servers. Rolled back to 6.18.22-lts. The kernel bug is documented, not swept under a rug.
- Bugs, failed runs, kernel regressions all logged in the open. The benchmark numbers are what the serving box actually did, not what was wished for.
What's blocked — XDNA 2 NPU
Gatekept. Not a performance limit. A vendor-access limit.
- AMD has not shipped a Linux execution provider for Strix Halo (STX-H). Ryzen AI 1.7 supports Strix Point and Krackan only.
- Every model in AMD's Ryzen AI Hugging Face collections ships UINT4-AWQ × BFP16. No ternary kernel. No 1.58-bit compile path.
- Native AIE kernel authoring is gated behind Riallto — Phoenix-only, requires a paid Xilinx license.
Verdict: defer. Revisit when AMD ships STX-H on Ryzen AI Linux, or when a community BitNet backend for XDNA 2 lands in public. Nothing on the near- or mid-term roadmap depends on the NPU shipping.
This is also the ship gate. No Reddit, no Hacker News, no press announce until the NPU unblocks or the project's positioning explicitly changes. Discord is for live discussion. GitHub is the record.
How to help
- Questions, hardware reports, benchmark diffs → this channel or GitHub Discussions.
- Concrete bugs → GitHub Issues with kernel version, ROCm version, commit SHA, journal excerpt.
- Code → pull requests. Highest-leverage drop is a HIP GEMM / GEMV kernel for a shape not yet covered. Kernels accepted on merit.
- Funding → a Patreon surface opens when public channels open; it underwrites compute time (training runs, H200 hours, retrains).
the goal is not another chatbot — it's a stack you can run on hardware you own, from a closet, silently, forever
Local ternary inference on AMD Strix Halo.
1bit.systems is a set of native C++ and Rust components that runs sub-2-bit language, speech, and image models on AMD Strix Halo — without Python at runtime, without a discrete GPU, and without the cloud.
This is the full documentation, the full history, and the full roadmap on one page. Scroll, Ctrl+F, or bookmark a section. No separate blog, no separate benchmark report. If a thing exists about the project, it should be here.
What it is
A serving stack built around 1.58-bit ternary weights. Kernels are hand-written HIP targeting gfx1151. Everything above the kernels — OpenAI-compatible HTTP surface, MCP bridge, session state, sampler — is Rust. A 70B-parameter model fits in 128 GB of unified memory at 1.58 bits per parameter, which is the whole reason this is possible on a mini-PC.
The identity is unified as 1bit — one name for the brand, the codebase, the install path, and the binaries. Earlier references to halo-ai in git history point to the same project under its prior name.
Who it's for
- People who want to run modern AI on hardware they own, with no cloud dependency.
- Researchers interested in sub-2-bit inference on consumer silicon.
- Developers who need a local LLM backend and don't want to ship a Python runtime.
- Writers, artists, and tinkerers who need privacy by topology — weights and prompts never leave the room.
Architecture
Hybrid C++ and Rust stack, layered from kernel to HTTP surface. Clients speak OpenAI-compatible HTTP and don't need to know what's underneath.
┌──────────────────────────────────────────────────────────┐
│ Clients · Open WebUI · MCP · CLI · Helm (planned) │
└─────────────────────────┬────────────────────────────────┘
│ OpenAI-compatible HTTP
┌─────────────────────────▼────────────────────────────────┐
│ 1bit-halo-server (Rust, axum) :8180 │
│ Router · Sessions · Sampler · Token streamer │
└─────────────────────────┬────────────────────────────────┘
│ FFI
┌─────────────────────────▼────────────────────────────────┐
│ bitnet_decode (C++20, HIP) :8080 │
│ Ternary GEMV · Split-KV FD attention · RoPE │
│ RMSNorm · SiLU · KV cache · Tokenizer · Sampler │
└─────────────────────────┬────────────────────────────────┘
│ HIP
┌─────────────────────────▼────────────────────────────────┐
│ Radeon 8060S · gfx1151 · 40 CU · wave32 WMMA │
└──────────────────────────────────────────────────────────┘
Every layer is native. Rust handles orchestration (HTTP, sessions, scheduling, streaming). C++20 + HIP handles kernels and model state. No Python in the serving path at any layer.
Deep architecture
Condensed walkthrough. The full treatment lives in the wiki at Architecture-Deep — request life-cycle byte accounting, per-kernel provenance, FFI signatures, the agent registry, training pipeline math, mesh topology, and the failure surface.
Ports + surfaces
| port | binding | service | surface |
|---|---|---|---|
| 443 | public | caddy | TLS, bearer check, /v1 / /v2 split |
| 8080 | 127.0.0.1 | bitnet_decode (C++) | /v1/* (gen-1) |
| 8180 | 127.0.0.1 | 1bit-halo-server (Rust) | /v2/* (gen-2) |
| 8081 | 127.0.0.1 | sd.cpp | SDXL image-gen sidecar |
| 8181 | 127.0.0.1 | halo-whisper | STT |
| 8182 | 127.0.0.1 | halo-kokoro | TTS |
| 8190 | 127.0.0.1 | 1bit-landing | landing + wiki proxy |
| 8200 | 127.0.0.1 | 1bit-lemonade | /v1/models gateway |
| stdio | — | halo-mcp | JSON-RPC 2.0 tools |
Request life-cycle
A chat-completion walked through every layer. Client POSTs to https://halo.<host>/v2/chat/completions with a bearer token.
- TLS terminates at Caddy. Constant-time bearer compare against
/etc/caddy/Caddyfile. Under 1 ms after the initial TCP+TLS setup. - Caddy
reverse_proxy localhost:8180. Plain HTTP/1.1 over loopback. - axum deserializes
Json<ChatCompletionRequest>. Shape errors return 400 before any inference work. Metrics histogram starts after the 400 check. 1bit-routerdispatches toBackend::Hip(the only compiled backend in production). TheBackend::Cpuvariant exists but returnsBackendError::CpuLaneStub.- Mutex around the shared KV cache locks.
posresets to 0 per request. KV bytes per token on halo v2:2 × 8 × 256 × 2 = 8 KiBper layer, 240 KiB total across 30 layers. - Tokenizer (
1bit-core::htok) encodes. Llama-3<|eot_id|>(128009) is recognised as a single ID — the fix that took burn-in parity from 18% to 96.67%. - Prefill: one forward pass per prompt token. Rust calls
1bit-hip::ternary_gemv_halo_f16which crossesextern "C"intorocm-cpp. Per-call overhead is measured-TBD; no Rust allocations per call. - Decode: sample (argmax on host at
temperature ≤ 0, sampler kernel above), append, write next K/V slot, check stop tokens on token IDs before detokenization. - Stream: if
req.stream, wrap in an accounting iterator that tallies tokens per SSE frame. Else return a single JSON body. - Caddy forwards bytes unchanged. Wall-clock: ~200 ms for a 10-token prompt + 10-token reply.
Kernels
ternary_gemv_halo_f16-
Packed ternary weights (2 bits per weight,
uint8[M, (K+3)/4]), FP32 row scale, INT8 activations, FP16 output. Usesv_dot4_i32_i8and WMMA on gfx1151. 92% of LPDDR5 peak at decode @ N=64. Bandwidth-bound; compute utilization under 10%. Bytes-read reduction (Sherry 1.25-bit, TQ1 base-3) is the #1 speedup lever. kv_cache_attn_decode_fd-
Split-KV Flash-Decoding. Per-head parallelism across thread-blocks; each head splits its KV range into B chunks, reduces to
(m, l, o)triples, combines via log-sum-exp. Landed 2026-04-19, default in both servers. 6.78× speedup at L=2048, bit-exact against the reference path. rope_fp16- Rotary position embedding, HF split-half convention. Pre-fix interleaved convention gave wikitext-103 PPL 524; post-fix ~12. Repeated-text PPL 4.29 → 1.04. Six-line diff.
rmsnorm_fp16,silu_glu_fp16-
RMSNorm + SwiGLU fused into the FFN path. The
relu2_glu_*variants exist for the activation-sparsity experiments (Phase 1 measured 79.91% sparsity; upper-bound speedup 10-15%, deferred behind Sherry). - KV cache
-
[num_layers][2][num_kv_heads][max_seq_len][head_dim]FP16. Append-only ring; one buffer per in-flight request; 128-bytehipMallocalignment.
Memory model
Unified LPDDR5X, 128 GB total. No PCIe copy — the same DDR bank the CPU just wrote to is the memory the iGPU reads via hipMalloc'd virtual addresses. For halo v2:
| region | size | lifetime |
|---|---|---|
| weights (.h1b mmap) | ~1.1 GiB | process; shared across sessions |
| KV cache @ 4096 ctx | ~960 MiB | per session; pinned |
| activations (scratch) | ~100 MiB | per forward; reused across layers |
| HIP runtime + ROCm | ~1 GiB | process |
| OS + everything else | ~4 GiB | system |
| subtotal | ~7 GiB | out of 128 GiB |
Model formats
.h1b v2: 4-byte magic H1B\0, int32 version, 9 int32 config (hidden_size, intermediate_size, num_layers, num_heads, num_kv_heads, vocab_size, max_seq_len, tie_embeddings, reserved), 2 float32 extras (rope_theta, rms_norm_eps), then per-layer tensors. reserved is a flag word: 0x1 Hadamard-rotated (BitNet v2), 0x2 Sherry FP16, 0x4 Bonsai Q1, 0x8 Bonsai TQ2.
.htok is the tokenizer side-file — Llama-3 128k BPE + special tokens, mmap-parsed at startup.
Why not GGUF: the GGUF loader walked ~5× slower against a cold page cache (per-tensor metadata we don't need), and the halo layout is O(num_layers) offset math. Conversion tool tools/gguf-to-h1b runs one-shot at dev time, not in a serving path.
Agents
17 specialists in 1bit-agents::Name. One registry, surfaced both on the agents bus and on the MCP tool list.
| specialist | role | dispatched by |
|---|---|---|
| Anvil | kernel rebuild + bench on rocm-cpp commits | anvil.timer |
| Carpenter | file scaffolding | Planner |
| Cartograph | cross-repo changelog + topology snapshot | Librarian |
| EchoEar | STT ingress (halo-whisper) | halo-voice |
| EchoMouth | TTS egress (halo-kokoro) | EchoEar / Herald |
| Forge | PR drafts + commit messages | Planner |
| Gateway | inbound classification + routing policy | watchers |
| Herald | comms / Q&A / chat replies | Gateway |
| Librarian | CHANGELOG + wiki upkeep | librarian.timer |
| Magistrate | PR review + CC lint + secret scan | gh-trio.timer |
| Muse | long-form prose | operator |
| Planner | multi-step task decomposition | operator |
| Quartermaster | issue triage | gh-trio.timer |
| Scribe | doc edits | Librarian / ops |
| Sentinel | incident watchdog | continuous |
| Sommelier | backend / model recommendation | Planner |
| Warden | secret + credential drift | ops |
Discord + GitHub pipelines
Discord: halo listens, echo posts. halo requires the privileged MESSAGE_CONTENT gateway intent plus GUILDS and GUILD_MESSAGES. Mentions are classified (BugReport → Sentinel, FeatureRequest → Magistrate, Question → Herald, Chat → Herald). Bug reports auto-create a thread on the original message.
GitHub: 1bit-watch-github polls every DEFAULT_POLL_SECONDS = 300. Read-only fine-grained PAT. Any PR → Magistrate. Label bug or title containing error / crash / fail → Sentinel. Label enhancement / feature → Planner. Label documentation → Scribe. Fallback → Sentinel. Lookback window is poll_seconds × 2.
MCP
1bit-mcp speaks JSON-RPC 2.0 over stdio, one object per \n-delimited line (Claude Code convention). Protocol version 2024-11-05. tools/list is derived from Name::ALL; tools/call dispatches through Arc<Registry>. 22 in-crate tests cover the wire framing, registry routing, and the five JSON-RPC error codes (-32600 invalid request, -32601 method not found, -32602 invalid params, -32603 internal, -32001 unknown tool).
Training pipeline
Retrains run on a persistent RunPod H200 pod. TRL + HuggingFace streaming loader. Step cadence: batch_size=16, seq_len=2048, grad_accum=32, log_every=10, save_every=100, verify_nm_mask_every=500. Per-step tokens 16 × 2048 × 32 ≈ 1.05M. 10 B-token budget ≈ 9600 steps. Measured throughput 49.5k tok/s on H200 → full run ≈ 56 h wall-clock. Artifacts flow: pod → pi archive → requantizer → .h1b → strixhalo → shadow-burnin → cutover.
Shadow-burnin
Continuous /v1 vs /v2 argmax comparison every 30 s.
state file : ~/.local/share/1bit-halo/shadow-burnin.state
log (JSONL): ~/claude output/shadow-burnin.jsonl
cutover : ≥ 96% bit-exact argmax over a 72 h rolling window
current : 96.67% (post-special-token fix)
Cutover gates: (1) PPL parity on wikitext within ±0.05 of the gen-1 baseline 9.1607 — gen-2 currently 9.1805, delta +0.02, PASS. (2) Shadow-burnin ≥96% for 72 h continuous under real traffic, no restarts, no memory growth.
Mesh
Private Headscale tailnet, 100.64.0.0/10 CGNAT-reserved range (RFC 6598). Coordinator on strixhalo, fronted by Caddy at :443, upstream loopback 127.0.0.1:8380. No Tailscale DERP, no third-party control plane. STUN on :3478; every pair is direct LAN.
| node | mesh IP | role |
|---|---|---|
| strixhalo | 100.64.0.1 | gfx1151, Caddy, Headscale, primary inference |
| sliger | 100.64.0.2 | NVIDIA 1080 Ti, failover candidate |
| ryzen | 100.64.0.3 | RX 9070 XT (gfx1201), second kernel target |
| pi | 100.64.0.4 | ZFS 3.6 TB, canonical archive, nightly rsync |
Deployment
All long-lived processes are user-scope systemd units installed from strixhalo/systemd/. Caddy fronts TLS at the system level. 1bit install <component> reads packages.toml — [components.<name>] with unit, binary, source, deps — and runs systemctl stop → cargo install --path → copy → systemctl start. Idempotent.
strix-server.service — 1bit-halo-server :8180 (gen-2 Rust)
1bit-halo-bitnet.service — bitnet_decode :8080 (gen-1 C++)
strix-lemonade.service — 1bit-lemonade :8200
strix-landing.service — 1bit-landing :8190
strix-echo.service — echo (Discord poster)
strix-watch-discord.service — halo (Discord listener)
strix-watch-github.timer — 300 s poll
1bit-halo-whisper.service — STT :8181
1bit-halo-kokoro.service — TTS :8182
1bit-halo-sd.service — sd.cpp :8081
1bit-halo-anvil.timer — kernel rebuild on rocm-cpp commits
1bit-halo-memory-sync.timer — GH push every 15 m
strix-burnin.service — shadow-burnin harness
Supply chain
Trusted dependencies: TheRock (ROCm), serenity-rs (Discord), octocrab (GitHub), axum + tower + hyper + reqwest, serde, nlohmann/json, cpp-httplib. Not used: hipBLAS at runtime (Rule C — banned), torch / any Python-serving library (Rule A), in-proc Python Open WebUI (caller-side only, sunsets on Helm v0.3).
Failure surface
Every service uses Restart=on-failure with a 5-10 s RestartSec. Blast radius is per-service — strix-server crashing returns 502 via Caddy but leaves the rest of the stack intact. Kernel-level issues (the OPTC CRTC hang signature REG_WAIT timeout 1us * 100000 tries - optc35_disable_crtc) are mitigated by halo-gpu-perf.service pinning SCLK high; persistent issues roll back via Btrfs + snapper to snapshot #6. Memory-sync failures usually mean an expired GH PAT and a rotation fixes them.
Full write-up including byte-accounting for a 32-token reply, FFI cheat sheet, who-calls-what graph, and the complete environment-variable surface: Architecture-Deep.
Constraints
Rule A — no Python at runtime
Hard rule. No Python interpreter in any serving binary. Caller-side tooling is any language you want; the serving path is not.
Rule B — C++20 for kernels, Rust for orchestration
Default language for a new component is C++20 if it talks to HIP directly, Rust otherwise. Rust gets the safety and ownership guarantees where correctness matters most; C++ gets HIP intrinsics, wave32 WMMA, and the register-level control that the ternary GEMV depends on.
No runtime hipBLAS
Native Tensile-generated kernels are allowed; runtime hipBLAS is banned because its heuristic collapses on the skinny ternary GEMV shape the models use.
Kernels: overview
| Kernel | Role |
|---|---|
ternary_gemv | Packed ternary × FP16 GEMV · hot path |
attention_fd | Split-KV Flash-Decoding · per-head parallel |
rope | Rotary position embedding · HF split-half |
rmsnorm | Root-mean-square normalization |
silu | Activation · SwiGLU companion |
kv_cache | KV-cache append + retrieval |
All live in rocm-cpp/src/ and rocm-cpp/kernels/. HIP C++ with wave32 WMMA intrinsics, tuned for gfx1151. A gfx1201 port for RX 9070 XT is in flight.
Ternary GEMV
Weights are stored as packed ternary values {−1, 0, +1} with a per-tensor FP16 scale. The kernel reads 2 bits per weight, multiplies by the FP16 activation, accumulates in FP32, and writes back FP16.
Current: 92% of LPDDR5 peak bandwidth on gfx1151. Memory-bandwidth-bound, not compute-bound. Reducing bytes-read per token is the #1 speedup lever; sub-1-bit formats are the research priority.
Attention · split-KV Flash-Decoding
Standard Flash-Decoding adapted for gfx1151 wave32. Each attention head splits its KV range across multiple thread-blocks; softmax statistics are combined with a log-sum-exp merge; output is written once per head.
Landed 2026-04-19 as the default in bitnet_decode. Speedup: 6.78× at context length 2048, bit-exact against the reference path.
RoPE · convention fix
Rotary Position Embedding carried a convention mismatch until 2026-04-19. The implementation used interleaved rotation; Hugging Face canonical models use split-half rotation. Fix was a six-line diff.
| metric | before | after |
|---|---|---|
| PPL · wikitext-103 | 524 | ~12 |
| PPL · repeated text | 4.29 | 1.04 |
Quant formats
| Format | Bits/weight | Status |
|---|---|---|
| BitNet 1.58 | 1.58 | Shipped (Halo v2) |
| TriLM | 1.58 | Shipped (experimental) |
| Sparse-BitNet · 3:4 | 1.25 | Retraining |
| BitNet v2 · W1.58 A4 | 1.58 W + 4 A | Watching |
| LittleBit | 0.1 | Watching |
Features
OpenAI-compatible HTTP
/v1/chat/completions, /v1/models, SSE streaming, bearer auth optional. Any OpenAI SDK works out of the box — point base_url at http://localhost:8180/v1.
Session-aware KV cache
Conversations keyed by X-1bit-Session header. KV cache is pinned per session, avoiding re-prefill on multi-turn threads.
MCP introspection
1bit-halo-mcp exposes model listing, health probes, KV-cache stats, sampler overrides, and kernel timing as Model Context Protocol tools. Attach from Claude Desktop, Claude Code, or any MCP client.
Local by topology
Every byte stays on the machine. No telemetry, no dial-home, no usage analytics. Caller-side clients may hit cloud APIs by choice; the serving path does not.
Halo v2 · BitNet 1.58
- 2B parameters · Microsoft's public BitNet release
- 1.58-bit weights · FP16 activations
- Served by
bitnet_decodeon:8080and1bit-halo-serveron:8180
| context | tok/s |
|---|---|
| 64 tokens | 66 |
| 1024 tokens | 33 |
Clean burn numbers from 2026-04-18, post-RoPE fix. Memory-bandwidth-bound across the whole context range.
TriLM
3.9B parameters, Apache 2.0, from the SpectraSuite (TriLM_3.9B_Unpacked). LLaMA architecture, ternary-trained from scratch. Used as a smoke-test model and as the NPU export candidate.
Sparse-BitNet
Retrain in progress on an H200 pod. Target: 1.25 effective bits per weight via 3:4 N:M sparsity layered on 1.58-bit weights.
Run 3 bailed at step 500 on a false-positive mask-integrity check (empty mask cache due to disabled monitoring). Run 4 launched 2026-04-22 with the patch — model.enable_mask_monitoring() at init, mask_cache.clear() after each verify, --save-every 100 to avoid losing progress to future bails.
Pre-bail numbers on Run 3 tracked cleanly: loss 11.04 → 5.77 across steps 50 → 500, throughput steady at 49.5k tok/s. Run 4 first checkpoint landed at step 100. 10B-token gate ETA ~57 h.
The stack
- Host OS: CachyOS (Arch-family, rolling) on Btrfs + snapper + limine
- Kernel: 6.18.22-lts — pinned after an amdgpu OPTC hang on 7.0
- ROCm: 7.x built from source against gfx1151 (not on ROCm's Tier-1 list)
- Kernels: C++20 + HIP · wave32 WMMA · zero runtime hipBLAS
- Orchestration: Rust 2021 · Cargo workspace · axum + tokio
- Edge: Caddy reverse-proxy for bearer auth and TLS
- Supervision: systemd (one unit per binary)
The machine
- CPU: Ryzen AI Max+ 395 · Zen 5 · 16 cores / 32 threads
- GPU: Radeon 8060S · 40 CU RDNA 3.5 · gfx1151 · wave32 WMMA
- NPU: XDNA 2 · 50 TOPS claimed · inaccessible — see NPU status
- Memory: 128 GB LPDDR5X-8000 · unified · ~270 GB/s peak
- Power: 45–120 W configurable TDP envelope
A secondary RDNA 4 target (RX 9070 XT on gfx1201) lives in the ryzen mesh node. A fat-binary build covering both arches is wired in rocm-cpp; the gfx1201 WMMA intrinsic port is in flight.
Serving
1bit-halo-server is Rust + axum, serving an OpenAI-compatible surface on :8180. It forwards to bitnet_decode over FFI, manages session state, streams tokens, and enforces per-session rate limits.
| Port | Service | Purpose |
|---|---|---|
| 8080 | bitnet_decode | Internal · upstream for 1bit-halo-server |
| 8180 | 1bit-halo-server | OpenAI-compatible HTTP · public-facing |
| 8181 | 1bit-halo-mcp | MCP server · tool & introspection surface |
| 8190 | 1bit-halo-whisper | Streaming STT (planned) |
| 8191 | 1bit-halo-kokoro | TTS (planned) |
MCP
1bit-halo-mcp exposes introspection and control as a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Claude Code, Continue, Cursor, custom agents) can attach and call tools for: model listing, health probes, KV-cache stats, active session inspection, sampler overrides, and kernel timing.
22 tests cover the surface. Canonical since 2026-04-19.
NPU status
No Linux execution provider for STX-H
AMD's Ryzen AI 1.7 Linux stack supports Strix Point (STX) and Krackan (KRK) only. Strix Halo (STX-H) has no Linux execution provider. The Windows stack exposes the NPU through a proprietary VitisAI provider that is Windows-only.
Quant format mismatch
AMD's Ryzen AI model collections ship UINT4-AWQ weights with BFP16 activations. No ternary kernel. No 1.58-bit compile path. MatMulNBits with N=4 is the only shape the AIE control-packet graph compiler accepts today.
Kernel authoring is gated
Writing native AIE kernels requires Riallto (Phoenix-only, Ubuntu 24.04.2 + Docker + paid Xilinx license, zero GEMM kernels shipped). Custom ternary kernels on XDNA 2 would need to be authored from scratch against this toolchain.
Current verdict
Defer. Revisit when AMD ships STX-H on Ryzen AI Linux, or when a community BitNet backend for XDNA 2 lands in public. Nothing on the near- or mid-term roadmap depends on the NPU shipping.
Requirements
The stack targets Strix Halo specifically. Other gfx1100-family hardware may work with minor tweaks; only Strix Halo is tested.
Hardware — minimum
- AMD Ryzen AI Max+ 395 (or equivalent Strix Halo SKU)
- Radeon 8060S iGPU · gfx1151 · wave32 WMMA
- 64 GB unified LPDDR5X minimum · 128 GB recommended for 13B+ ternary
- 100 GB free disk for models plus build artifacts
Software — minimum
- Linux kernel 6.18.22-lts (newer kernels carry the amdgpu OPTC hang — see troubleshooting)
- ROCm 7.x — built from source against gfx1151 (not on ROCm's Tier-1 list)
- LLVM / clang 18+
- CMake 3.27+
- Rust 1.82+ (stable channel)
- Node.js or Bun only for caller-side clients. Nothing on the serving path — Rule A.
Recommended host
CachyOS with Btrfs + snapper + limine is the reference setup. Rollback-via-snapper has saved the project more than once. Fish shell is assumed in examples but not required.
Install
No binary distribution yet. Build from source. Packaging (AppImage + Flatpak) is on the near-term roadmap; 1bit-halo-pkg model package manager is long-term.
Build ROCm against gfx1151
System-package ROCm drops gfx1151 from Tier-1 in most distros. Build from source, or use the llamacpp-rocm fork's install script as a bootstrap.
git clone https://github.com/bong-water-water-bong/llamacpp-rocm ~/repos/llamacpp-rocm
cd ~/repos/llamacpp-rocm
./scripts/install-rocm.sh --target gfx1151
Build rocm-cpp kernel library
git clone https://github.com/bong-water-water-bong/rocm-cpp ~/repos/rocm-cpp
cd ~/repos/rocm-cpp
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DAMDGPU_TARGETS=gfx1151 \
-DCMAKE_HIP_ARCHITECTURES=gfx1151
cmake --build build -j$(nproc)
sudo cmake --install build --prefix /usr/local
Build 1bit-halo-core (bitnet_decode)
# private repo today; public release gated on NPU ship-gate
cd ~/1bit-halo-core
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bitnet_decode --help
Build 1bit-halo-server and 1bit-halo-mcp
cd ~/1bit-halo-workspace
cargo build --release --bin 1bit-halo-server
cargo build --release --bin 1bit-halo-mcp
ls -la target/release/1bit-halo-server target/release/1bit-halo-mcp
Fetch model weights
# 1bit-halo-pkg is not shipped yet. Manual download for now:
mkdir -p ~/1bit-halo/models
cd ~/1bit-halo/models
# 1bit-halo-v2 · BitNet 1.58 · 2B
curl -LO https://.../1bit-halo-v2.h1b # actual URL TBD
# TriLM 3.9B Unpacked (experimental)
curl -LO https://.../trilm-3.9b.h1b
bitnet_decode, 1bit-halo-server, 1bit-halo-mcp, and kernel binaries ship zero Python. Carve-outs (Open WebUI, lemonade-server) are caller-side and sunset on 1bit-helm v0.3 parity.
Second target — RX 9070 XT (gfx1201)
Radeon RX 9070 XT (Navi 48, RDNA 4) lives in the ryzen mesh host and is the secondary kernel target. The build system is already multi-arch: HIP bundles per-arch code objects into a fat binary and picks at load time. Default build covers both.
The hot intrinsics — __builtin_amdgcn_wmma_*_w32, __builtin_amdgcn_sudot4, __builtin_amdgcn_sdot4 — are retained on RDNA 4. Correctness holds out of the gate. Peak throughput is not yet tuned for gfx1201; block sizes and LDS budgets are still sized for gfx1151. A fresh K-outer tile sweep is needed for GDDR6 bandwidth (~640 GB/s on 9070 XT vs ~270 GB/s LPDDR5X on Strix Halo).
Build for gfx1201
# single-arch build, 9070 XT only
GFX=gfx1201 ./install.sh
# fat-binary build, runs on both strixhalo and ryzen
GFX="gfx1151;gfx1201" ./install.sh
# auto-detect via rocminfo (use on each host natively)
GFX=auto ./install.sh
Prereq: ROCm must be present on ryzen first. Easiest path is the same TheRock source build used on Strix Halo, re-targeted to Navi 48. System-package ROCm may also work on RDNA 4 in distros that ship it; verify with rocminfo.
ssh ryzen
ls /opt/rocm* ~/therock 2>/dev/null # confirm a ROCm dist exists
rocminfo | grep -E 'Name:|gfx' # expect gfx1201
First run
Start bitnet_decode on the dev port, then 1bit-halo-server as the OpenAI-compatible front. Verify with curl.
Start the inference core
cd ~/1bit-halo-core
./build/bitnet_decode \
--model ~/1bit-halo/models/1bit-halo-v2.h1b \
--port 8080 \
--context 4096 \
--attn split-kv-fd \
--rope-mode hf-split-half
Start the HTTP surface
cd ~/1bit-halo-workspace
./target/release/1bit-halo-server \
--upstream http://127.0.0.1:8080 \
--bind 0.0.0.0:8180
Verify
curl -s http://127.0.0.1:8180/v1/models | jq
# expect: {"data": [{"id": "1bit-halo-v2", ...}]}
curl -s http://127.0.0.1:8180/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "1bit-halo-v2",
"messages": [{"role":"user","content":"say hello"}]
}' | jq
Services & systemd
Production is systemd. One unit per binary. Units live in /etc/systemd/system/ (system scope) or ~/.config/systemd/user/ (user scope). LTS kernel needs LimitMEMLOCK=infinity for the inference core or pinning fails.
1bit-halo-bitnet.service
# /etc/systemd/system/1bit-halo-bitnet.service
[Unit]
Description=1bit bitnet_decode (HIP inference core)
After=network.target 1bit-halo-gpu-perf.service
Requires=1bit-halo-gpu-perf.service
[Service]
Type=simple
User=1bit-halo
Group=1bit-halo
ExecStart=/usr/local/bin/bitnet_decode \
--model /var/lib/1bit-halo/models/1bit-halo-v2.h1b \
--port 8080 \
--context 4096 \
--attn split-kv-fd \
--rope-mode hf-split-half
Restart=on-failure
RestartSec=5
LimitMEMLOCK=infinity
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
1bit-halo-server.service
# /etc/systemd/system/1bit-halo-server.service
[Unit]
Description=1bit OpenAI-compatible HTTP surface
After=network.target 1bit-halo-bitnet.service
Requires=1bit-halo-bitnet.service
[Service]
Type=simple
User=1bit-halo
ExecStart=/usr/local/bin/1bit-halo-server \
--upstream http://127.0.0.1:8080 \
--bind 0.0.0.0:8180
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
1bit-halo-gpu-perf.service
Pins SCLK high to avoid latency spikes under sustained load. Required on LTS 6.18.22.
# /etc/systemd/system/1bit-halo-gpu-perf.service
[Unit]
Description=1bit GPU perf pinning (SCLK high)
After=multi-user.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/sh -c 'echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level'
[Install]
WantedBy=multi-user.target
Enable & check
sudo systemctl daemon-reload
sudo systemctl enable --now 1bit-halo-gpu-perf 1bit-halo-bitnet 1bit-halo-server
systemctl status 1bit-halo-bitnet 1bit-halo-server
journalctl -u 1bit-halo-bitnet -f
Default ports
| Port | Service | Purpose |
|---|---|---|
| 8080 | bitnet_decode | Internal dev · upstream for 1bit-halo-server |
| 8180 | 1bit-halo-server | OpenAI-compatible HTTP · public-facing |
| 8181 | 1bit-halo-mcp | MCP server · tool & introspection surface |
| 8190 | 1bit-halo-whisper | Streaming STT (planned) |
| 8191 | 1bit-halo-kokoro | TTS (planned) |
Connect — Open WebUI
Open WebUI is the blessed third-party client today. Carve-out under Rule A (caller-side only; sunsets on 1bit-helm v0.3 parity).
Docker path
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8180/v1 \
-e OPENAI_API_KEY=none \
-v openwebui-data:/app/backend/data \
--name openwebui \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:main
Native path (pipx)
pipx install open-webui
OPENAI_API_BASE_URL=http://127.0.0.1:8180/v1 \
OPENAI_API_KEY=none \
open-webui serve --port 3000
Visit http://localhost:3000, create the first admin account (stored locally), select 1bit-halo-v2 from the model dropdown.
Connect — Raw HTTP
Everything speaks OpenAI. No special client needed. Handy for smoke tests and shell scripts.
List models
curl -s http://127.0.0.1:8180/v1/models | jq '.data[].id'
One-shot completion
curl -s http://127.0.0.1:8180/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "1bit-halo-v2",
"messages": [
{"role":"system","content":"Be concise."},
{"role":"user","content":"Explain ternary weights in one sentence."}
],
"temperature": 0.7,
"max_tokens": 256
}' | jq -r '.choices[0].message.content'
Streaming (SSE)
curl -N http://127.0.0.1:8180/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "1bit-halo-v2",
"messages": [{"role":"user","content":"count to ten"}],
"stream": true
}'
# server-sent events: each chunk is `data: {...}\n\n`
Connect — MCP clients
1bit-halo-mcp exposes introspection and control as a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Claude Code, Continue, Cursor, custom agents) can attach.
Claude Desktop
// ~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"1bit": {
"command": "/usr/local/bin/1bit-halo-mcp",
"args": ["--server", "http://127.0.0.1:8180"]
}
}
}
Claude Code
claude mcp add 1bit /usr/local/bin/1bit-halo-mcp -- \
--server http://127.0.0.1:8180
1bit-halo-mcp exposes tools for: model listing, health probes, KV-cache stats, active session inspection, sampler overrides (temperature, top-p, top-k), and kernel timing. 22 tests cover the surface as of 2026-04-19.
Connect — Custom / SDK
Any OpenAI SDK works. Examples below in Python (caller-side), TypeScript (caller-side), and Rust.
Python · openai-python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8180/v1",
api_key="none", # 1bit-halo-server ignores the key by default
)
stream = client.chat.completions.create(
model="1bit-halo-v2",
messages=[{"role": "user", "content": "Hello, ternary."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
TypeScript · openai (Bun-friendly)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8180/v1",
apiKey: "none",
});
const stream = await client.chat.completions.create({
model: "1bit-halo-v2",
messages: [{ role: "user", content: "Hello, ternary." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
Rust · async-openai
use async_openai::{Client, config::OpenAIConfig, types::*};
use futures::StreamExt;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let config = OpenAIConfig::new()
.with_api_base("http://localhost:8180/v1")
.with_api_key("none");
let client = Client::with_config(config);
let req = CreateChatCompletionRequestArgs::default()
.model("1bit-halo-v2")
.messages([ChatCompletionRequestUserMessageArgs::default()
.content("Hello, ternary.")
.build()?
.into()])
.stream(true)
.build()?;
let mut stream = client.chat().create_stream(req).await?;
while let Some(result) = stream.next().await {
if let Ok(chunk) = result {
if let Some(content) = &chunk.choices[0].delta.content {
print!("{content}");
}
}
}
Ok(())
}
Add your own app
Third-party apps attach on the client side only. Rule A hard stop: no Python, no Node, no interpreted runtime inside 1bit-halo-server or downstream. Anything above it — UIs, agents, game bots, IDE plugins — is fair game in any language.
The shape of a caller-side app
- Speak OpenAI-compatible HTTP to
:8180. Every SDK works. - For richer introspection, connect to
1bit-halo-mcpat:8181. - Use
1bit-halo-server's session headerX-1bit-halo-Sessionto pin a conversation to a KV-cache slot. - Handle
429/503with exponential back-off — the server returnsRetry-After.
Example — minimal agent harness
// minimal-agent.ts · run with `bun run minimal-agent.ts`
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8180/v1",
apiKey: "none",
});
const session = crypto.randomUUID();
const history: OpenAI.ChatCompletionMessageParam[] = [
{ role: "system", content: "You are a terse assistant." },
];
async function turn(user: string) {
history.push({ role: "user", content: user });
const res = await client.chat.completions.create(
{ model: "1bit-halo-v2", messages: history },
{ headers: { "X-1bit-halo-Session": session } },
);
const reply = res.choices[0].message.content ?? "";
history.push({ role: "assistant", content: reply });
return reply;
}
console.log(await turn("Two facts about RDNA 3.5."));
console.log(await turn("And one that contradicts a common myth."));
Where your app lives
If the app is a serving surface (game integration, Discord bot, MCP bridge, API adapter), it belongs in 1bit.services, not in 1bit.systems core. Core stays kernel + serving only.
If the app is a library meant to be embedded (SDK wrapper, client helper), keep it in your own repo. The project maintains the HTTP contract; you maintain the client surface.
1bit-halo-server and bitnet_decode is internal and changes without notice. Build on HTTP, not on FFI.
Troubleshooting
Known failure modes and their fixes. Ordered by frequency, not severity.
amdgpu OPTC CRTC hang — full Wayland freeze
Symptom: compositor freezes hard under concurrent model servers. Requires power-cycle. Kernel log:
amdgpu: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR*
REG_WAIT timeout 1us * 100000 tries - optc35_disable_crtc
Cause: gfx1151 bug on kernel 7.x.
Fix: rollback to LTS via snapper snapshot #6 ("7.00 with claude"):
sudo snapper -c root rollback 6
sudo limine-mkconfig
sudo reboot
# at limine menu, pick 6.18.22-lts
SMU / VCN / PSP hang on LTS
Symptom: journalctl -b on boot:
amdgpu: SMU: Failed to send message 0x... rv -110 (-ETIME)
amdgpu: [PSP] Failed to load IP FW — LOAD_IP_FW failed
amdgpu: VPE / VCN powergate transition failed
Cause: /etc/modprobe.d/halo.conf Tier-3b parameters tuned for 7.0 misfire on LTS.
Fix:
sudo mv /etc/modprobe.d/halo.conf /etc/modprobe.d/halo.conf.disabled
sudo mkinitcpio -P
sudo reboot
Long-context PPL explodes
Symptom: PPL rises monotonically with context, repetition PPL > 4.
Cause: RoPE convention drift (interleaved vs HF split-half). Fixed 2026-04-19.
Fix: confirm flag and commit:
bitnet_decode --rope-mode hf-split-half # not `interleaved`
git -C ~/1bit-halo-core log --oneline | grep -i rope
# must include the 2026-04-19 fix commit
Service won't start — mlock failed
Symptom: systemd unit exits immediately, journal shows:
bitnet_decode: mlock failed: Operation not permitted
Fix: add to the unit:
LimitMEMLOCK=infinity
The NPU probe path (xrt-smi) needs the same, for the day the gate opens.
ROCm build fails — gfx1151 not supported
Symptom: CMake reports target not supported, or linker bails on unknown arch.
Fix: pass the target explicitly everywhere:
cmake -B build \
-DAMDGPU_TARGETS=gfx1151 \
-DCMAKE_HIP_ARCHITECTURES=gfx1151 \
-DGPU_TARGETS=gfx1151
If the distro ROCm drops the arch entirely, build from source. The llamacpp-rocm fork's install script is the paved road.
Latency spikes under load
Symptom: tok/s drops 30 – 60% after the first minute of sustained generation.
Cause: SCLK falls out of high-perf state.
Fix: 1bit-halo-gpu-perf.service pins SCLK high. Verify:
systemctl status 1bit-halo-gpu-perf
cat /sys/class/drm/card0/device/power_dpm_force_performance_level
# expect: high
cat /sys/class/drm/card0/device/pp_dpm_sclk
1bit-halo-memory-sync failing every 15 min
Symptom: user timer logs show:
1bit-halo-memory-sync: push failed (credentials?)
Cause: GitHub PAT expired, corrupted GH_TOKEN fish universal variable, or missing admin:public_key scope.
Fix:
# 1. clear corrupted universal env
set -e --universal GH_TOKEN
# 2. refresh auth with correct scopes
gh auth login --scopes "admin:public_key,repo,workflow"
# 3. re-run the timer
systemctl --user restart 1bit-halo-memory-sync.timer
journalctl --user -u 1bit-halo-memory-sync -f
Observability
Three levels: service logs, kernel-level profiling, and model-quality benchmarking. Everything local; no telemetry leaves the host.
Logs
# service logs
journalctl -u 1bit-halo-bitnet -f
journalctl -u 1bit-halo-server -f --since "1h ago"
# user-scope services
journalctl --user -u 1bit-halo-memory-sync.timer
Structured JSON logs with Loki + Grafana on the same host are planned; for now, journalctl -o json is the paved road.
Kernel profiling
# bandwidth-bound sanity check
rocprof --stats --timestamp on \
./build/bitnet_decode --model 1bit-halo-v2.h1b --port 8080 --bench 64
# expect: ternary GEMV at ~92% LPDDR5 peak
# if lower, the tile or packed layout regressed
Model quality — PPL harness
# wikitext-103 perplexity · post-RoPE-fix reference numbers:
# 1bit-halo-v2: PPL ~12 on wikitext-103, ~1.04 on repetition
./build/bitnet_decode \
--model ~/1bit-halo/models/1bit-halo-v2.h1b \
--ppl ~/datasets/wikitext-103/wiki.test.tokens \
--context 2048
Live benchmark
# clean-burn reference numbers (2026-04-18):
# 64-token context: 66 tok/s
# 1024-token context: 33 tok/s
./build/bitnet_decode --bench 64 --bench 256 --bench 1024
Output conventions
On the reference host, benchmark JSON lands in /home/bcloud/claude output/. Other hosts pick their own path; the convention matters for the project's internal tracking only.
Roadmap
- Sparse-BitNet retrain completion (H200 pod · 10B-token budget)
- BitNet v2 implementation — Hadamard-native W1.58 A4
- MedusaBitNet speculative heads — expected 1.4–1.8× at batch = 1
- gfx1201 WMMA intrinsic port for RX 9070 XT (second target)
- Streaming STT via halo-whisper sentence-boundary partials
- Image generation via
sd.cppnative-HIP port, SDXL on gfx1151 - Video generation port — Wan 2.2 TI2V-5B (5B DiT, Apache 2.0)
- Desktop shell — voice-first, plugin API via MCP, package manager
- NPU unblock — gated on AMD
Changelog
- 0.1.6 · 2026-04-22 — Sparse-BitNet Run 4 launched. Kernel rolled back to 6.18.22-lts. Network topology documented. gfx1201 build variant wired.
- 0.1.5 · 2026-04-21 — TriLM INT4 ONNX export complete. NPU placement confirmed blocked. Six-crash investigation and kernel-7.0 rollback plan.
- 0.1.4 · 2026-04-20 — Bare-metal-first lock-in. AMD and AMDResearch org scan. XDNA 2 defer verdict.
- 0.1.3 · 2026-04-19 — RoPE split-half fix (PPL 524 → ~12). Split-KV Flash-Decoding attention (6.78× at L=2048). PPL harness landed.
- 0.1.2 · 2026-04-18 — Sherry 1.25-bit spike committed.
sd.cppnative-HIP port promoted to core. - 0.1.0 · 2026-04 —
bitnet_decodeonline. Halo v2 first responding. ROCm 7.x system build against gfx1151.
Contact
Project source lives at github.com/bong-water-water-bong/1bit-systems. Open issues, discussions, and pull requests all welcome.
A Discord server exists for live discussion; the invite URL opens when the server is ready for wider links. A Patreon surface underwrites compute time (training runs, H200 pod hours, retrains) and opens when public channels open.
— end of document —