v0.1 · docs
1bit.systems · announcement

Local ternary inference on AMD Strix Halo.

Native C++ and HIP kernels below, Rust orchestration above. Zero Python at runtime. Zero cloud dependency.

A 70-billion-parameter model fits in 128 GB of unified memory at 1.58 bits per weight. That is the whole reason this is possible on a mini-PC.

website · https://1bit.systems/  ·  source · https://github.com/bong-water-water-bong/1bit-systems

What works today

  • Halo v2 — 2B BitNet 1.58 serving at 66 tok/s on 64-token context, 33 tok/s at 1024-token context. Measured on the reference Strix Halo box, no asterisks.
  • Ternary GEMV at 92% of LPDDR5 peak bandwidth on gfx1151. Memory-bandwidth-bound — reducing bytes per token is where the next speedup lives.
  • Split-KV Flash-Decoding attention6.78× at context length 2048, bit-exact vs reference.
  • OpenAI-compatible HTTP on :8180. Any openai SDK, Open WebUI, DSPy, or Claude Code MCP client points at it and it just works.
  • MCP surface on :8181 for introspection, KV-cache stats, sampler overrides.
  • Zero telemetry, zero dial-home, zero cloud. Weights and prompts never leave the machine.

What's in flight

  • Sparse-BitNet retrain on an H200 pod — targeting 1.25 effective bits per weight via 3:4 N:M sparsity on top of 1.58-bit. Run 4 live. 10B-token budget, ~57 h wall-clock.
  • BitNet v2 implementation (Hadamard-native W1.58 A4) planned next.
  • gfx1201 build variant for RX 9070 XT — second hardware target. ROCm 7.2.2 live on the second box. WMMA intrinsic port still in progress.
  • Desktop shell — voice-first, plugin API via MCP, package manager.

Glass walls — the honest part

If something broke, it is on the page.

  • Sparse-BitNet Run 3 died at step 500 on a three-line trainer bug — mask monitoring was off by default so the integrity check always saw an empty cache and fired a false-positive bail. 524M tokens of H200 time lost. Full autopsy on the site. Run 4 is the patched relaunch.
  • RDNA 4 WMMA port is not a flag flip — the intrinsics changed family between gfx11 and gfx12. Real work, tracked in the open.
  • amdgpu OPTC CRTC hang on kernel 7.0 for gfx1151 — Wayland freezes hard under concurrent model servers. Rolled back to 6.18.22-lts. The kernel bug is documented, not swept under a rug.
  • Bugs, failed runs, kernel regressions all logged in the open. The benchmark numbers are what the serving box actually did, not what was wished for.

What's blocked — XDNA 2 NPU

Gatekept. Not a performance limit. A vendor-access limit.

  1. AMD has not shipped a Linux execution provider for Strix Halo (STX-H). Ryzen AI 1.7 supports Strix Point and Krackan only.
  2. Every model in AMD's Ryzen AI Hugging Face collections ships UINT4-AWQ × BFP16. No ternary kernel. No 1.58-bit compile path.
  3. Native AIE kernel authoring is gated behind Riallto — Phoenix-only, requires a paid Xilinx license.

Verdict: defer. Revisit when AMD ships STX-H on Ryzen AI Linux, or when a community BitNet backend for XDNA 2 lands in public. Nothing on the near- or mid-term roadmap depends on the NPU shipping.

This is also the ship gate. No Reddit, no Hacker News, no press announce until the NPU unblocks or the project's positioning explicitly changes. Discord is for live discussion. GitHub is the record.

How to help

  • Questions, hardware reports, benchmark diffs → this channel or GitHub Discussions.
  • Concrete bugs → GitHub Issues with kernel version, ROCm version, commit SHA, journal excerpt.
  • Code → pull requests. Highest-leverage drop is a HIP GEMM / GEMV kernel for a shape not yet covered. Kernels accepted on merit.
  • Funding → a Patreon surface opens when public channels open; it underwrites compute time (training runs, H200 hours, retrains).

the goal is not another chatbot — it's a stack you can run on hardware you own, from a closet, silently, forever

1bit.systems · v0.1

Local ternary inference on AMD Strix Halo.

1bit.systems is a set of native C++ and Rust components that runs sub-2-bit language, speech, and image models on AMD Strix Halo — without Python at runtime, without a discrete GPU, and without the cloud.

This is the full documentation, the full history, and the full roadmap on one page. Scroll, Ctrl+F, or bookmark a section. No separate blog, no separate benchmark report. If a thing exists about the project, it should be here.

Heads up — the XDNA 2 NPU on Strix Halo is not yet accessible to this project. AMD has not shipped a Linux execution provider for Strix Halo. Everything here runs on the integrated GPU. See NPU status.

What it is

A serving stack built around 1.58-bit ternary weights. Kernels are hand-written HIP targeting gfx1151. Everything above the kernels — OpenAI-compatible HTTP surface, MCP bridge, session state, sampler — is Rust. A 70B-parameter model fits in 128 GB of unified memory at 1.58 bits per parameter, which is the whole reason this is possible on a mini-PC.

The identity is unified as 1bit — one name for the brand, the codebase, the install path, and the binaries. Earlier references to halo-ai in git history point to the same project under its prior name.

Who it's for

  • People who want to run modern AI on hardware they own, with no cloud dependency.
  • Researchers interested in sub-2-bit inference on consumer silicon.
  • Developers who need a local LLM backend and don't want to ship a Python runtime.
  • Writers, artists, and tinkerers who need privacy by topology — weights and prompts never leave the room.
· · ·

Architecture

Hybrid C++ and Rust stack, layered from kernel to HTTP surface. Clients speak OpenAI-compatible HTTP and don't need to know what's underneath.

┌──────────────────────────────────────────────────────────┐
│  Clients · Open WebUI · MCP · CLI · Helm (planned)       │
└─────────────────────────┬────────────────────────────────┘
                          │  OpenAI-compatible HTTP
┌─────────────────────────▼────────────────────────────────┐
│  1bit-halo-server  (Rust, axum)                  :8180   │
│    Router · Sessions · Sampler · Token streamer          │
└─────────────────────────┬────────────────────────────────┘
                          │  FFI
┌─────────────────────────▼────────────────────────────────┐
│  bitnet_decode  (C++20, HIP)                     :8080   │
│    Ternary GEMV · Split-KV FD attention · RoPE           │
│    RMSNorm · SiLU · KV cache · Tokenizer · Sampler       │
└─────────────────────────┬────────────────────────────────┘
                          │  HIP
┌─────────────────────────▼────────────────────────────────┐
│  Radeon 8060S  ·  gfx1151  ·  40 CU  ·  wave32 WMMA      │
└──────────────────────────────────────────────────────────┘

Every layer is native. Rust handles orchestration (HTTP, sessions, scheduling, streaming). C++20 + HIP handles kernels and model state. No Python in the serving path at any layer.

Deep architecture

Condensed walkthrough. The full treatment lives in the wiki at Architecture-Deep — request life-cycle byte accounting, per-kernel provenance, FFI signatures, the agent registry, training pipeline math, mesh topology, and the failure surface.

Ports + surfaces

portbindingservicesurface
443publiccaddyTLS, bearer check, /v1 / /v2 split
8080127.0.0.1bitnet_decode (C++)/v1/* (gen-1)
8180127.0.0.11bit-halo-server (Rust)/v2/* (gen-2)
8081127.0.0.1sd.cppSDXL image-gen sidecar
8181127.0.0.1halo-whisperSTT
8182127.0.0.1halo-kokoroTTS
8190127.0.0.11bit-landinglanding + wiki proxy
8200127.0.0.11bit-lemonade/v1/models gateway
stdiohalo-mcpJSON-RPC 2.0 tools

Request life-cycle

A chat-completion walked through every layer. Client POSTs to https://halo.<host>/v2/chat/completions with a bearer token.

  1. TLS terminates at Caddy. Constant-time bearer compare against /etc/caddy/Caddyfile. Under 1 ms after the initial TCP+TLS setup.
  2. Caddy reverse_proxy localhost:8180. Plain HTTP/1.1 over loopback.
  3. axum deserializes Json<ChatCompletionRequest>. Shape errors return 400 before any inference work. Metrics histogram starts after the 400 check.
  4. 1bit-router dispatches to Backend::Hip (the only compiled backend in production). The Backend::Cpu variant exists but returns BackendError::CpuLaneStub.
  5. Mutex around the shared KV cache locks. pos resets to 0 per request. KV bytes per token on halo v2: 2 × 8 × 256 × 2 = 8 KiB per layer, 240 KiB total across 30 layers.
  6. Tokenizer (1bit-core::htok) encodes. Llama-3 <|eot_id|> (128009) is recognised as a single ID — the fix that took burn-in parity from 18% to 96.67%.
  7. Prefill: one forward pass per prompt token. Rust calls 1bit-hip::ternary_gemv_halo_f16 which crosses extern "C" into rocm-cpp. Per-call overhead is measured-TBD; no Rust allocations per call.
  8. Decode: sample (argmax on host at temperature ≤ 0, sampler kernel above), append, write next K/V slot, check stop tokens on token IDs before detokenization.
  9. Stream: if req.stream, wrap in an accounting iterator that tallies tokens per SSE frame. Else return a single JSON body.
  10. Caddy forwards bytes unchanged. Wall-clock: ~200 ms for a 10-token prompt + 10-token reply.

Kernels

ternary_gemv_halo_f16
Packed ternary weights (2 bits per weight, uint8[M, (K+3)/4]), FP32 row scale, INT8 activations, FP16 output. Uses v_dot4_i32_i8 and WMMA on gfx1151. 92% of LPDDR5 peak at decode @ N=64. Bandwidth-bound; compute utilization under 10%. Bytes-read reduction (Sherry 1.25-bit, TQ1 base-3) is the #1 speedup lever.
kv_cache_attn_decode_fd
Split-KV Flash-Decoding. Per-head parallelism across thread-blocks; each head splits its KV range into B chunks, reduces to (m, l, o) triples, combines via log-sum-exp. Landed 2026-04-19, default in both servers. 6.78× speedup at L=2048, bit-exact against the reference path.
rope_fp16
Rotary position embedding, HF split-half convention. Pre-fix interleaved convention gave wikitext-103 PPL 524; post-fix ~12. Repeated-text PPL 4.29 → 1.04. Six-line diff.
rmsnorm_fp16, silu_glu_fp16
RMSNorm + SwiGLU fused into the FFN path. The relu2_glu_* variants exist for the activation-sparsity experiments (Phase 1 measured 79.91% sparsity; upper-bound speedup 10-15%, deferred behind Sherry).
KV cache
[num_layers][2][num_kv_heads][max_seq_len][head_dim] FP16. Append-only ring; one buffer per in-flight request; 128-byte hipMalloc alignment.

Memory model

Unified LPDDR5X, 128 GB total. No PCIe copy — the same DDR bank the CPU just wrote to is the memory the iGPU reads via hipMalloc'd virtual addresses. For halo v2:

regionsizelifetime
weights (.h1b mmap)~1.1 GiBprocess; shared across sessions
KV cache @ 4096 ctx~960 MiBper session; pinned
activations (scratch)~100 MiBper forward; reused across layers
HIP runtime + ROCm~1 GiBprocess
OS + everything else~4 GiBsystem
subtotal~7 GiBout of 128 GiB

Model formats

.h1b v2: 4-byte magic H1B\0, int32 version, 9 int32 config (hidden_size, intermediate_size, num_layers, num_heads, num_kv_heads, vocab_size, max_seq_len, tie_embeddings, reserved), 2 float32 extras (rope_theta, rms_norm_eps), then per-layer tensors. reserved is a flag word: 0x1 Hadamard-rotated (BitNet v2), 0x2 Sherry FP16, 0x4 Bonsai Q1, 0x8 Bonsai TQ2.

.htok is the tokenizer side-file — Llama-3 128k BPE + special tokens, mmap-parsed at startup.

Why not GGUF: the GGUF loader walked ~5× slower against a cold page cache (per-tensor metadata we don't need), and the halo layout is O(num_layers) offset math. Conversion tool tools/gguf-to-h1b runs one-shot at dev time, not in a serving path.

Agents

17 specialists in 1bit-agents::Name. One registry, surfaced both on the agents bus and on the MCP tool list.

specialistroledispatched by
Anvilkernel rebuild + bench on rocm-cpp commitsanvil.timer
Carpenterfile scaffoldingPlanner
Cartographcross-repo changelog + topology snapshotLibrarian
EchoEarSTT ingress (halo-whisper)halo-voice
EchoMouthTTS egress (halo-kokoro)EchoEar / Herald
ForgePR drafts + commit messagesPlanner
Gatewayinbound classification + routing policywatchers
Heraldcomms / Q&A / chat repliesGateway
LibrarianCHANGELOG + wiki upkeeplibrarian.timer
MagistratePR review + CC lint + secret scangh-trio.timer
Muselong-form proseoperator
Plannermulti-step task decompositionoperator
Quartermasterissue triagegh-trio.timer
Scribedoc editsLibrarian / ops
Sentinelincident watchdogcontinuous
Sommelierbackend / model recommendationPlanner
Wardensecret + credential driftops

Discord + GitHub pipelines

Discord: halo listens, echo posts. halo requires the privileged MESSAGE_CONTENT gateway intent plus GUILDS and GUILD_MESSAGES. Mentions are classified (BugReport → Sentinel, FeatureRequest → Magistrate, Question → Herald, Chat → Herald). Bug reports auto-create a thread on the original message.

GitHub: 1bit-watch-github polls every DEFAULT_POLL_SECONDS = 300. Read-only fine-grained PAT. Any PR → Magistrate. Label bug or title containing error / crash / fail → Sentinel. Label enhancement / feature → Planner. Label documentation → Scribe. Fallback → Sentinel. Lookback window is poll_seconds × 2.

MCP

1bit-mcp speaks JSON-RPC 2.0 over stdio, one object per \n-delimited line (Claude Code convention). Protocol version 2024-11-05. tools/list is derived from Name::ALL; tools/call dispatches through Arc<Registry>. 22 in-crate tests cover the wire framing, registry routing, and the five JSON-RPC error codes (-32600 invalid request, -32601 method not found, -32602 invalid params, -32603 internal, -32001 unknown tool).

Training pipeline

Retrains run on a persistent RunPod H200 pod. TRL + HuggingFace streaming loader. Step cadence: batch_size=16, seq_len=2048, grad_accum=32, log_every=10, save_every=100, verify_nm_mask_every=500. Per-step tokens 16 × 2048 × 32 ≈ 1.05M. 10 B-token budget ≈ 9600 steps. Measured throughput 49.5k tok/s on H200 → full run ≈ 56 h wall-clock. Artifacts flow: pod → pi archive → requantizer → .h1b → strixhalo → shadow-burnin → cutover.

Shadow-burnin

Continuous /v1 vs /v2 argmax comparison every 30 s.

state file : ~/.local/share/1bit-halo/shadow-burnin.state
log (JSONL): ~/claude output/shadow-burnin.jsonl
cutover    : ≥ 96% bit-exact argmax over a 72 h rolling window
current    : 96.67% (post-special-token fix)

Cutover gates: (1) PPL parity on wikitext within ±0.05 of the gen-1 baseline 9.1607 — gen-2 currently 9.1805, delta +0.02, PASS. (2) Shadow-burnin ≥96% for 72 h continuous under real traffic, no restarts, no memory growth.

Mesh

Private Headscale tailnet, 100.64.0.0/10 CGNAT-reserved range (RFC 6598). Coordinator on strixhalo, fronted by Caddy at :443, upstream loopback 127.0.0.1:8380. No Tailscale DERP, no third-party control plane. STUN on :3478; every pair is direct LAN.

nodemesh IProle
strixhalo100.64.0.1gfx1151, Caddy, Headscale, primary inference
sliger100.64.0.2NVIDIA 1080 Ti, failover candidate
ryzen100.64.0.3RX 9070 XT (gfx1201), second kernel target
pi100.64.0.4ZFS 3.6 TB, canonical archive, nightly rsync

Deployment

All long-lived processes are user-scope systemd units installed from strixhalo/systemd/. Caddy fronts TLS at the system level. 1bit install <component> reads packages.toml[components.<name>] with unit, binary, source, deps — and runs systemctl stop → cargo install --path → copy → systemctl start. Idempotent.

strix-server.service        — 1bit-halo-server :8180  (gen-2 Rust)
1bit-halo-bitnet.service    — bitnet_decode :8080     (gen-1 C++)
strix-lemonade.service      — 1bit-lemonade :8200
strix-landing.service       — 1bit-landing :8190
strix-echo.service          — echo (Discord poster)
strix-watch-discord.service — halo (Discord listener)
strix-watch-github.timer    — 300 s poll
1bit-halo-whisper.service   — STT :8181
1bit-halo-kokoro.service    — TTS :8182
1bit-halo-sd.service        — sd.cpp :8081
1bit-halo-anvil.timer       — kernel rebuild on rocm-cpp commits
1bit-halo-memory-sync.timer — GH push every 15 m
strix-burnin.service        — shadow-burnin harness

Supply chain

Trusted dependencies: TheRock (ROCm), serenity-rs (Discord), octocrab (GitHub), axum + tower + hyper + reqwest, serde, nlohmann/json, cpp-httplib. Not used: hipBLAS at runtime (Rule C — banned), torch / any Python-serving library (Rule A), in-proc Python Open WebUI (caller-side only, sunsets on Helm v0.3).

Failure surface

Every service uses Restart=on-failure with a 5-10 s RestartSec. Blast radius is per-service — strix-server crashing returns 502 via Caddy but leaves the rest of the stack intact. Kernel-level issues (the OPTC CRTC hang signature REG_WAIT timeout 1us * 100000 tries - optc35_disable_crtc) are mitigated by halo-gpu-perf.service pinning SCLK high; persistent issues roll back via Btrfs + snapper to snapshot #6. Memory-sync failures usually mean an expired GH PAT and a rotation fixes them.

Full write-up including byte-accounting for a 32-token reply, FFI cheat sheet, who-calls-what graph, and the complete environment-variable surface: Architecture-Deep.

Constraints

Rule A — no Python at runtime

Hard rule. No Python interpreter in any serving binary. Caller-side tooling is any language you want; the serving path is not.

Rule B — C++20 for kernels, Rust for orchestration

Default language for a new component is C++20 if it talks to HIP directly, Rust otherwise. Rust gets the safety and ownership guarantees where correctness matters most; C++ gets HIP intrinsics, wave32 WMMA, and the register-level control that the ternary GEMV depends on.

No runtime hipBLAS

Native Tensile-generated kernels are allowed; runtime hipBLAS is banned because its heuristic collapses on the skinny ternary GEMV shape the models use.

Kernels: overview

KernelRole
ternary_gemvPacked ternary × FP16 GEMV · hot path
attention_fdSplit-KV Flash-Decoding · per-head parallel
ropeRotary position embedding · HF split-half
rmsnormRoot-mean-square normalization
siluActivation · SwiGLU companion
kv_cacheKV-cache append + retrieval

All live in rocm-cpp/src/ and rocm-cpp/kernels/. HIP C++ with wave32 WMMA intrinsics, tuned for gfx1151. A gfx1201 port for RX 9070 XT is in flight.

Ternary GEMV

Weights are stored as packed ternary values {−1, 0, +1} with a per-tensor FP16 scale. The kernel reads 2 bits per weight, multiplies by the FP16 activation, accumulates in FP32, and writes back FP16.

Current: 92% of LPDDR5 peak bandwidth on gfx1151. Memory-bandwidth-bound, not compute-bound. Reducing bytes-read per token is the #1 speedup lever; sub-1-bit formats are the research priority.

Attention · split-KV Flash-Decoding

Standard Flash-Decoding adapted for gfx1151 wave32. Each attention head splits its KV range across multiple thread-blocks; softmax statistics are combined with a log-sum-exp merge; output is written once per head.

Landed 2026-04-19 as the default in bitnet_decode. Speedup: 6.78× at context length 2048, bit-exact against the reference path.

RoPE · convention fix

Rotary Position Embedding carried a convention mismatch until 2026-04-19. The implementation used interleaved rotation; Hugging Face canonical models use split-half rotation. Fix was a six-line diff.

metricbeforeafter
PPL · wikitext-103524~12
PPL · repeated text4.291.04

Quant formats

FormatBits/weightStatus
BitNet 1.581.58Shipped (Halo v2)
TriLM1.58Shipped (experimental)
Sparse-BitNet · 3:41.25Retraining
BitNet v2 · W1.58 A41.58 W + 4 AWatching
LittleBit0.1Watching

Features

OpenAI-compatible HTTP

/v1/chat/completions, /v1/models, SSE streaming, bearer auth optional. Any OpenAI SDK works out of the box — point base_url at http://localhost:8180/v1.

Session-aware KV cache

Conversations keyed by X-1bit-Session header. KV cache is pinned per session, avoiding re-prefill on multi-turn threads.

MCP introspection

1bit-halo-mcp exposes model listing, health probes, KV-cache stats, sampler overrides, and kernel timing as Model Context Protocol tools. Attach from Claude Desktop, Claude Code, or any MCP client.

Local by topology

Every byte stays on the machine. No telemetry, no dial-home, no usage analytics. Caller-side clients may hit cloud APIs by choice; the serving path does not.

· · ·

Halo v2 · BitNet 1.58

  • 2B parameters · Microsoft's public BitNet release
  • 1.58-bit weights · FP16 activations
  • Served by bitnet_decode on :8080 and 1bit-halo-server on :8180
contexttok/s
64 tokens66
1024 tokens33

Clean burn numbers from 2026-04-18, post-RoPE fix. Memory-bandwidth-bound across the whole context range.

TriLM

3.9B parameters, Apache 2.0, from the SpectraSuite (TriLM_3.9B_Unpacked). LLaMA architecture, ternary-trained from scratch. Used as a smoke-test model and as the NPU export candidate.

Sparse-BitNet

Retrain in progress on an H200 pod. Target: 1.25 effective bits per weight via 3:4 N:M sparsity layered on 1.58-bit weights.

Run 3 bailed at step 500 on a false-positive mask-integrity check (empty mask cache due to disabled monitoring). Run 4 launched 2026-04-22 with the patch — model.enable_mask_monitoring() at init, mask_cache.clear() after each verify, --save-every 100 to avoid losing progress to future bails.

Pre-bail numbers on Run 3 tracked cleanly: loss 11.04 → 5.77 across steps 50 → 500, throughput steady at 49.5k tok/s. Run 4 first checkpoint landed at step 100. 10B-token gate ETA ~57 h.

· · ·

The stack

  • Host OS: CachyOS (Arch-family, rolling) on Btrfs + snapper + limine
  • Kernel: 6.18.22-lts — pinned after an amdgpu OPTC hang on 7.0
  • ROCm: 7.x built from source against gfx1151 (not on ROCm's Tier-1 list)
  • Kernels: C++20 + HIP · wave32 WMMA · zero runtime hipBLAS
  • Orchestration: Rust 2021 · Cargo workspace · axum + tokio
  • Edge: Caddy reverse-proxy for bearer auth and TLS
  • Supervision: systemd (one unit per binary)

The machine

  • CPU: Ryzen AI Max+ 395 · Zen 5 · 16 cores / 32 threads
  • GPU: Radeon 8060S · 40 CU RDNA 3.5 · gfx1151 · wave32 WMMA
  • NPU: XDNA 2 · 50 TOPS claimed · inaccessible — see NPU status
  • Memory: 128 GB LPDDR5X-8000 · unified · ~270 GB/s peak
  • Power: 45–120 W configurable TDP envelope

A secondary RDNA 4 target (RX 9070 XT on gfx1201) lives in the ryzen mesh node. A fat-binary build covering both arches is wired in rocm-cpp; the gfx1201 WMMA intrinsic port is in flight.

Serving

1bit-halo-server is Rust + axum, serving an OpenAI-compatible surface on :8180. It forwards to bitnet_decode over FFI, manages session state, streams tokens, and enforces per-session rate limits.

PortServicePurpose
8080bitnet_decodeInternal · upstream for 1bit-halo-server
81801bit-halo-serverOpenAI-compatible HTTP · public-facing
81811bit-halo-mcpMCP server · tool & introspection surface
81901bit-halo-whisperStreaming STT (planned)
81911bit-halo-kokoroTTS (planned)

MCP

1bit-halo-mcp exposes introspection and control as a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Claude Code, Continue, Cursor, custom agents) can attach and call tools for: model listing, health probes, KV-cache stats, active session inspection, sampler overrides, and kernel timing.

22 tests cover the surface. Canonical since 2026-04-19.

NPU status

Gatekept — Strix Halo ships with an XDNA 2 NPU, but the software path is not available to this project today. The project runs on the iGPU until one of the blockers below clears.

No Linux execution provider for STX-H

AMD's Ryzen AI 1.7 Linux stack supports Strix Point (STX) and Krackan (KRK) only. Strix Halo (STX-H) has no Linux execution provider. The Windows stack exposes the NPU through a proprietary VitisAI provider that is Windows-only.

Quant format mismatch

AMD's Ryzen AI model collections ship UINT4-AWQ weights with BFP16 activations. No ternary kernel. No 1.58-bit compile path. MatMulNBits with N=4 is the only shape the AIE control-packet graph compiler accepts today.

Kernel authoring is gated

Writing native AIE kernels requires Riallto (Phoenix-only, Ubuntu 24.04.2 + Docker + paid Xilinx license, zero GEMM kernels shipped). Custom ternary kernels on XDNA 2 would need to be authored from scratch against this toolchain.

Current verdict

Defer. Revisit when AMD ships STX-H on Ryzen AI Linux, or when a community BitNet backend for XDNA 2 lands in public. Nothing on the near- or mid-term roadmap depends on the NPU shipping.

· · ·

Requirements

The stack targets Strix Halo specifically. Other gfx1100-family hardware may work with minor tweaks; only Strix Halo is tested.

Hardware — minimum

  • AMD Ryzen AI Max+ 395 (or equivalent Strix Halo SKU)
  • Radeon 8060S iGPU · gfx1151 · wave32 WMMA
  • 64 GB unified LPDDR5X minimum · 128 GB recommended for 13B+ ternary
  • 100 GB free disk for models plus build artifacts

Software — minimum

  • Linux kernel 6.18.22-lts (newer kernels carry the amdgpu OPTC hang — see troubleshooting)
  • ROCm 7.x — built from source against gfx1151 (not on ROCm's Tier-1 list)
  • LLVM / clang 18+
  • CMake 3.27+
  • Rust 1.82+ (stable channel)
  • Node.js or Bun only for caller-side clients. Nothing on the serving path — Rule A.

Recommended host

CachyOS with Btrfs + snapper + limine is the reference setup. Rollback-via-snapper has saved the project more than once. Fish shell is assumed in examples but not required.

Install

No binary distribution yet. Build from source. Packaging (AppImage + Flatpak) is on the near-term roadmap; 1bit-halo-pkg model package manager is long-term.

Build ROCm against gfx1151

System-package ROCm drops gfx1151 from Tier-1 in most distros. Build from source, or use the llamacpp-rocm fork's install script as a bootstrap.

git clone https://github.com/bong-water-water-bong/llamacpp-rocm ~/repos/llamacpp-rocm
cd ~/repos/llamacpp-rocm
./scripts/install-rocm.sh --target gfx1151

Build rocm-cpp kernel library

git clone https://github.com/bong-water-water-bong/rocm-cpp ~/repos/rocm-cpp
cd ~/repos/rocm-cpp

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DAMDGPU_TARGETS=gfx1151 \
  -DCMAKE_HIP_ARCHITECTURES=gfx1151

cmake --build build -j$(nproc)
sudo cmake --install build --prefix /usr/local

Build 1bit-halo-core (bitnet_decode)

# private repo today; public release gated on NPU ship-gate
cd ~/1bit-halo-core

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

./build/bitnet_decode --help

Build 1bit-halo-server and 1bit-halo-mcp

cd ~/1bit-halo-workspace

cargo build --release --bin 1bit-halo-server
cargo build --release --bin 1bit-halo-mcp

ls -la target/release/1bit-halo-server target/release/1bit-halo-mcp

Fetch model weights

# 1bit-halo-pkg is not shipped yet. Manual download for now:
mkdir -p ~/1bit-halo/models
cd ~/1bit-halo/models

# 1bit-halo-v2 · BitNet 1.58 · 2B
curl -LO https://.../1bit-halo-v2.h1b   # actual URL TBD

# TriLM 3.9B Unpacked (experimental)
curl -LO https://.../trilm-3.9b.h1b
Rule A reminder — Python may appear in caller-side tooling and dev-time scripts only. bitnet_decode, 1bit-halo-server, 1bit-halo-mcp, and kernel binaries ship zero Python. Carve-outs (Open WebUI, lemonade-server) are caller-side and sunset on 1bit-helm v0.3 parity.

Second target — RX 9070 XT (gfx1201)

Radeon RX 9070 XT (Navi 48, RDNA 4) lives in the ryzen mesh host and is the secondary kernel target. The build system is already multi-arch: HIP bundles per-arch code objects into a fat binary and picks at load time. Default build covers both.

The hot intrinsics — __builtin_amdgcn_wmma_*_w32, __builtin_amdgcn_sudot4, __builtin_amdgcn_sdot4 — are retained on RDNA 4. Correctness holds out of the gate. Peak throughput is not yet tuned for gfx1201; block sizes and LDS budgets are still sized for gfx1151. A fresh K-outer tile sweep is needed for GDDR6 bandwidth (~640 GB/s on 9070 XT vs ~270 GB/s LPDDR5X on Strix Halo).

Build for gfx1201

# single-arch build, 9070 XT only
GFX=gfx1201 ./install.sh

# fat-binary build, runs on both strixhalo and ryzen
GFX="gfx1151;gfx1201" ./install.sh

# auto-detect via rocminfo (use on each host natively)
GFX=auto ./install.sh

Prereq: ROCm must be present on ryzen first. Easiest path is the same TheRock source build used on Strix Halo, re-targeted to Navi 48. System-package ROCm may also work on RDNA 4 in distros that ship it; verify with rocminfo.

ssh ryzen
ls /opt/rocm* ~/therock 2>/dev/null     # confirm a ROCm dist exists
rocminfo | grep -E 'Name:|gfx'            # expect gfx1201

First run

Start bitnet_decode on the dev port, then 1bit-halo-server as the OpenAI-compatible front. Verify with curl.

Start the inference core

cd ~/1bit-halo-core
./build/bitnet_decode \
  --model ~/1bit-halo/models/1bit-halo-v2.h1b \
  --port 8080 \
  --context 4096 \
  --attn split-kv-fd \
  --rope-mode hf-split-half

Start the HTTP surface

cd ~/1bit-halo-workspace
./target/release/1bit-halo-server \
  --upstream http://127.0.0.1:8080 \
  --bind 0.0.0.0:8180

Verify

curl -s http://127.0.0.1:8180/v1/models | jq
# expect: {"data": [{"id": "1bit-halo-v2", ...}]}

curl -s http://127.0.0.1:8180/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "1bit-halo-v2",
    "messages": [{"role":"user","content":"say hello"}]
  }' | jq

Services & systemd

Production is systemd. One unit per binary. Units live in /etc/systemd/system/ (system scope) or ~/.config/systemd/user/ (user scope). LTS kernel needs LimitMEMLOCK=infinity for the inference core or pinning fails.

1bit-halo-bitnet.service

# /etc/systemd/system/1bit-halo-bitnet.service
[Unit]
Description=1bit bitnet_decode (HIP inference core)
After=network.target 1bit-halo-gpu-perf.service
Requires=1bit-halo-gpu-perf.service

[Service]
Type=simple
User=1bit-halo
Group=1bit-halo
ExecStart=/usr/local/bin/bitnet_decode \
  --model /var/lib/1bit-halo/models/1bit-halo-v2.h1b \
  --port 8080 \
  --context 4096 \
  --attn split-kv-fd \
  --rope-mode hf-split-half
Restart=on-failure
RestartSec=5
LimitMEMLOCK=infinity
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

1bit-halo-server.service

# /etc/systemd/system/1bit-halo-server.service
[Unit]
Description=1bit OpenAI-compatible HTTP surface
After=network.target 1bit-halo-bitnet.service
Requires=1bit-halo-bitnet.service

[Service]
Type=simple
User=1bit-halo
ExecStart=/usr/local/bin/1bit-halo-server \
  --upstream http://127.0.0.1:8080 \
  --bind 0.0.0.0:8180
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

1bit-halo-gpu-perf.service

Pins SCLK high to avoid latency spikes under sustained load. Required on LTS 6.18.22.

# /etc/systemd/system/1bit-halo-gpu-perf.service
[Unit]
Description=1bit GPU perf pinning (SCLK high)
After=multi-user.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/sh -c 'echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level'

[Install]
WantedBy=multi-user.target

Enable & check

sudo systemctl daemon-reload
sudo systemctl enable --now 1bit-halo-gpu-perf 1bit-halo-bitnet 1bit-halo-server

systemctl status 1bit-halo-bitnet 1bit-halo-server
journalctl -u 1bit-halo-bitnet -f

Default ports

PortServicePurpose
8080bitnet_decodeInternal dev · upstream for 1bit-halo-server
81801bit-halo-serverOpenAI-compatible HTTP · public-facing
81811bit-halo-mcpMCP server · tool & introspection surface
81901bit-halo-whisperStreaming STT (planned)
81911bit-halo-kokoroTTS (planned)

Connect — Open WebUI

Open WebUI is the blessed third-party client today. Carve-out under Rule A (caller-side only; sunsets on 1bit-helm v0.3 parity).

Docker path

docker run -d -p 3000:8080 \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8180/v1 \
  -e OPENAI_API_KEY=none \
  -v openwebui-data:/app/backend/data \
  --name openwebui \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

Native path (pipx)

pipx install open-webui
OPENAI_API_BASE_URL=http://127.0.0.1:8180/v1 \
OPENAI_API_KEY=none \
open-webui serve --port 3000

Visit http://localhost:3000, create the first admin account (stored locally), select 1bit-halo-v2 from the model dropdown.

Connect — Raw HTTP

Everything speaks OpenAI. No special client needed. Handy for smoke tests and shell scripts.

List models

curl -s http://127.0.0.1:8180/v1/models | jq '.data[].id'

One-shot completion

curl -s http://127.0.0.1:8180/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "1bit-halo-v2",
    "messages": [
      {"role":"system","content":"Be concise."},
      {"role":"user","content":"Explain ternary weights in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | jq -r '.choices[0].message.content'

Streaming (SSE)

curl -N http://127.0.0.1:8180/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "1bit-halo-v2",
    "messages": [{"role":"user","content":"count to ten"}],
    "stream": true
  }'
# server-sent events: each chunk is `data: {...}\n\n`

Connect — MCP clients

1bit-halo-mcp exposes introspection and control as a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Claude Code, Continue, Cursor, custom agents) can attach.

Claude Desktop

// ~/.config/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "1bit": {
      "command": "/usr/local/bin/1bit-halo-mcp",
      "args": ["--server", "http://127.0.0.1:8180"]
    }
  }
}

Claude Code

claude mcp add 1bit /usr/local/bin/1bit-halo-mcp -- \
  --server http://127.0.0.1:8180

1bit-halo-mcp exposes tools for: model listing, health probes, KV-cache stats, active session inspection, sampler overrides (temperature, top-p, top-k), and kernel timing. 22 tests cover the surface as of 2026-04-19.

Connect — Custom / SDK

Any OpenAI SDK works. Examples below in Python (caller-side), TypeScript (caller-side), and Rust.

Python · openai-python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8180/v1",
    api_key="none",  # 1bit-halo-server ignores the key by default
)

stream = client.chat.completions.create(
    model="1bit-halo-v2",
    messages=[{"role": "user", "content": "Hello, ternary."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

TypeScript · openai (Bun-friendly)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8180/v1",
  apiKey: "none",
});

const stream = await client.chat.completions.create({
  model: "1bit-halo-v2",
  messages: [{ role: "user", content: "Hello, ternary." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Rust · async-openai

use async_openai::{Client, config::OpenAIConfig, types::*};
use futures::StreamExt;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = OpenAIConfig::new()
        .with_api_base("http://localhost:8180/v1")
        .with_api_key("none");
    let client = Client::with_config(config);

    let req = CreateChatCompletionRequestArgs::default()
        .model("1bit-halo-v2")
        .messages([ChatCompletionRequestUserMessageArgs::default()
            .content("Hello, ternary.")
            .build()?
            .into()])
        .stream(true)
        .build()?;

    let mut stream = client.chat().create_stream(req).await?;
    while let Some(result) = stream.next().await {
        if let Ok(chunk) = result {
            if let Some(content) = &chunk.choices[0].delta.content {
                print!("{content}");
            }
        }
    }
    Ok(())
}

Add your own app

Third-party apps attach on the client side only. Rule A hard stop: no Python, no Node, no interpreted runtime inside 1bit-halo-server or downstream. Anything above it — UIs, agents, game bots, IDE plugins — is fair game in any language.

The shape of a caller-side app

  1. Speak OpenAI-compatible HTTP to :8180. Every SDK works.
  2. For richer introspection, connect to 1bit-halo-mcp at :8181.
  3. Use 1bit-halo-server's session header X-1bit-halo-Session to pin a conversation to a KV-cache slot.
  4. Handle 429 / 503 with exponential back-off — the server returns Retry-After.

Example — minimal agent harness

// minimal-agent.ts · run with `bun run minimal-agent.ts`
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8180/v1",
  apiKey: "none",
});

const session = crypto.randomUUID();
const history: OpenAI.ChatCompletionMessageParam[] = [
  { role: "system", content: "You are a terse assistant." },
];

async function turn(user: string) {
  history.push({ role: "user", content: user });
  const res = await client.chat.completions.create(
    { model: "1bit-halo-v2", messages: history },
    { headers: { "X-1bit-halo-Session": session } },
  );
  const reply = res.choices[0].message.content ?? "";
  history.push({ role: "assistant", content: reply });
  return reply;
}

console.log(await turn("Two facts about RDNA 3.5."));
console.log(await turn("And one that contradicts a common myth."));

Where your app lives

If the app is a serving surface (game integration, Discord bot, MCP bridge, API adapter), it belongs in 1bit.services, not in 1bit.systems core. Core stays kernel + serving only.

If the app is a library meant to be embedded (SDK wrapper, client helper), keep it in your own repo. The project maintains the HTTP contract; you maintain the client surface.

API stability — the OpenAI-compatible surface is the stable contract. The FFI boundary between 1bit-halo-server and bitnet_decode is internal and changes without notice. Build on HTTP, not on FFI.

Troubleshooting

Known failure modes and their fixes. Ordered by frequency, not severity.

amdgpu OPTC CRTC hang — full Wayland freeze

Symptom: compositor freezes hard under concurrent model servers. Requires power-cycle. Kernel log:

amdgpu: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR*
  REG_WAIT timeout 1us * 100000 tries - optc35_disable_crtc

Cause: gfx1151 bug on kernel 7.x.

Fix: rollback to LTS via snapper snapshot #6 ("7.00 with claude"):

sudo snapper -c root rollback 6
sudo limine-mkconfig
sudo reboot
# at limine menu, pick 6.18.22-lts

SMU / VCN / PSP hang on LTS

Symptom: journalctl -b on boot:

amdgpu: SMU: Failed to send message 0x... rv -110  (-ETIME)
amdgpu: [PSP] Failed to load IP FW — LOAD_IP_FW failed
amdgpu: VPE / VCN powergate transition failed

Cause: /etc/modprobe.d/halo.conf Tier-3b parameters tuned for 7.0 misfire on LTS.

Fix:

sudo mv /etc/modprobe.d/halo.conf /etc/modprobe.d/halo.conf.disabled
sudo mkinitcpio -P
sudo reboot

Long-context PPL explodes

Symptom: PPL rises monotonically with context, repetition PPL > 4.

Cause: RoPE convention drift (interleaved vs HF split-half). Fixed 2026-04-19.

Fix: confirm flag and commit:

bitnet_decode --rope-mode hf-split-half   # not `interleaved`
git -C ~/1bit-halo-core log --oneline | grep -i rope
# must include the 2026-04-19 fix commit

Service won't start — mlock failed

Symptom: systemd unit exits immediately, journal shows:

bitnet_decode: mlock failed: Operation not permitted

Fix: add to the unit:

LimitMEMLOCK=infinity

The NPU probe path (xrt-smi) needs the same, for the day the gate opens.

ROCm build fails — gfx1151 not supported

Symptom: CMake reports target not supported, or linker bails on unknown arch.

Fix: pass the target explicitly everywhere:

cmake -B build \
  -DAMDGPU_TARGETS=gfx1151 \
  -DCMAKE_HIP_ARCHITECTURES=gfx1151 \
  -DGPU_TARGETS=gfx1151

If the distro ROCm drops the arch entirely, build from source. The llamacpp-rocm fork's install script is the paved road.

Latency spikes under load

Symptom: tok/s drops 30 – 60% after the first minute of sustained generation.

Cause: SCLK falls out of high-perf state.

Fix: 1bit-halo-gpu-perf.service pins SCLK high. Verify:

systemctl status 1bit-halo-gpu-perf
cat /sys/class/drm/card0/device/power_dpm_force_performance_level
# expect: high
cat /sys/class/drm/card0/device/pp_dpm_sclk

1bit-halo-memory-sync failing every 15 min

Symptom: user timer logs show:

1bit-halo-memory-sync: push failed (credentials?)

Cause: GitHub PAT expired, corrupted GH_TOKEN fish universal variable, or missing admin:public_key scope.

Fix:

# 1. clear corrupted universal env
set -e --universal GH_TOKEN

# 2. refresh auth with correct scopes
gh auth login --scopes "admin:public_key,repo,workflow"

# 3. re-run the timer
systemctl --user restart 1bit-halo-memory-sync.timer
journalctl --user -u 1bit-halo-memory-sync -f

Observability

Three levels: service logs, kernel-level profiling, and model-quality benchmarking. Everything local; no telemetry leaves the host.

Logs

# service logs
journalctl -u 1bit-halo-bitnet -f
journalctl -u 1bit-halo-server -f --since "1h ago"

# user-scope services
journalctl --user -u 1bit-halo-memory-sync.timer

Structured JSON logs with Loki + Grafana on the same host are planned; for now, journalctl -o json is the paved road.

Kernel profiling

# bandwidth-bound sanity check
rocprof --stats --timestamp on \
  ./build/bitnet_decode --model 1bit-halo-v2.h1b --port 8080 --bench 64

# expect: ternary GEMV at ~92% LPDDR5 peak
# if lower, the tile or packed layout regressed

Model quality — PPL harness

# wikitext-103 perplexity · post-RoPE-fix reference numbers:
# 1bit-halo-v2: PPL ~12 on wikitext-103, ~1.04 on repetition
./build/bitnet_decode \
  --model ~/1bit-halo/models/1bit-halo-v2.h1b \
  --ppl ~/datasets/wikitext-103/wiki.test.tokens \
  --context 2048

Live benchmark

# clean-burn reference numbers (2026-04-18):
# 64-token context:   66 tok/s
# 1024-token context: 33 tok/s
./build/bitnet_decode --bench 64 --bench 256 --bench 1024

Output conventions

On the reference host, benchmark JSON lands in /home/bcloud/claude output/. Other hosts pick their own path; the convention matters for the project's internal tracking only.

· · ·

Roadmap

  1. Sparse-BitNet retrain completion (H200 pod · 10B-token budget)
  2. BitNet v2 implementation — Hadamard-native W1.58 A4
  3. MedusaBitNet speculative heads — expected 1.4–1.8× at batch = 1
  4. gfx1201 WMMA intrinsic port for RX 9070 XT (second target)
  5. Streaming STT via halo-whisper sentence-boundary partials
  6. Image generation via sd.cpp native-HIP port, SDXL on gfx1151
  7. Video generation port — Wan 2.2 TI2V-5B (5B DiT, Apache 2.0)
  8. Desktop shell — voice-first, plugin API via MCP, package manager
  9. NPU unblock — gated on AMD

Changelog

  • 0.1.6 · 2026-04-22 — Sparse-BitNet Run 4 launched. Kernel rolled back to 6.18.22-lts. Network topology documented. gfx1201 build variant wired.
  • 0.1.5 · 2026-04-21 — TriLM INT4 ONNX export complete. NPU placement confirmed blocked. Six-crash investigation and kernel-7.0 rollback plan.
  • 0.1.4 · 2026-04-20 — Bare-metal-first lock-in. AMD and AMDResearch org scan. XDNA 2 defer verdict.
  • 0.1.3 · 2026-04-19 — RoPE split-half fix (PPL 524 → ~12). Split-KV Flash-Decoding attention (6.78× at L=2048). PPL harness landed.
  • 0.1.2 · 2026-04-18 — Sherry 1.25-bit spike committed. sd.cpp native-HIP port promoted to core.
  • 0.1.0 · 2026-04 — bitnet_decode online. Halo v2 first responding. ROCm 7.x system build against gfx1151.

Contact

Project source lives at github.com/bong-water-water-bong/1bit-systems. Open issues, discussions, and pull requests all welcome.

A Discord server exists for live discussion; the invite URL opens when the server is ready for wider links. A Patreon surface underwrites compute time (training runs, H200 pod hours, retrains) and opens when public channels open.

— end of document —