v0.1 · docs
1bit.systems · v0.1

Local ternary inference on consumer AMD silicon.

1bit.systems is a native C++ and Rust stack for sub-2-bit language, speech, and image models on AMD Strix Halo. No Python at runtime. No discrete GPU. No cloud dependency.

This is the full documentation, the full history, and the full roadmap on one page. Scroll, Ctrl+F, or bookmark a section. No separate blog, no separate benchmark report. If a thing exists about the project, it should be here.

Heads up — the XDNA 2 NPU on Strix Halo is not yet accessible to this project. AMD has not shipped a Linux execution provider for Strix Halo. Everything here runs on the integrated GPU. See NPU status.

What it is

A serving stack built around 1.58-bit ternary weights. Kernels are hand-written HIP targeting gfx1151. Everything above the kernels — OpenAI-compatible HTTP surface, MCP bridge, session state, sampler — is Rust. A 70B-parameter model fits in 128 GB of unified memory at 1.58 bits per parameter, which is the whole reason this is possible on a mini-PC.

The identity is unified as 1bit — one name for the brand, the codebase, the install path, and the binaries. Earlier references to halo-ai in git history point to the same project under its prior name.

Who it's for

  • People who want to run modern AI on hardware they own, with no cloud dependency.
  • Researchers interested in sub-2-bit inference on consumer silicon.
  • Developers who need a local LLM backend and don't want to ship a Python runtime.
  • Writers, artists, and tinkerers who need privacy by topology — weights and prompts never leave the room.
· · ·

Architecture

Hybrid C++ and Rust stack, layered from kernel to HTTP surface. Clients speak OpenAI-compatible HTTP and don't need to know what's underneath.

┌──────────────────────────────────────────────────────────┐
│  Clients · Open WebUI · MCP · CLI · Helm (planned)       │
└─────────────────────────┬────────────────────────────────┘
                          │  OpenAI-compatible HTTP
┌─────────────────────────▼────────────────────────────────┐
│  1bit-halo-server  (Rust, axum)                  :8180   │
│    Router · Sessions · Sampler · Token streamer          │
└─────────────────────────┬────────────────────────────────┘
                          │  FFI
┌─────────────────────────▼────────────────────────────────┐
│  bitnet_decode  (C++20, HIP)                     :8080   │
│    Ternary GEMV · Split-KV FD attention · RoPE           │
│    RMSNorm · SiLU · KV cache · Tokenizer · Sampler       │
└─────────────────────────┬────────────────────────────────┘
                          │  HIP
┌─────────────────────────▼────────────────────────────────┐
│  Radeon 8060S  ·  gfx1151  ·  40 CU  ·  wave32 WMMA      │
└──────────────────────────────────────────────────────────┘

Every layer is native. Rust handles orchestration (HTTP, sessions, scheduling, streaming). C++20 + HIP handles kernels and model state. No Python in the serving path at any layer.

Constraints

Rule A — no Python at runtime

Hard rule. No Python interpreter in any serving binary. Caller-side tooling is any language you want; the serving path is not.

Rule B — C++20 for kernels, Rust for orchestration

Default language for a new component is C++20 if it talks to HIP directly, Rust otherwise. Rust gets the safety and ownership guarantees where correctness matters most; C++ gets HIP intrinsics, wave32 WMMA, and the register-level control that the ternary GEMV depends on.

No runtime hipBLAS

Native Tensile-generated kernels are allowed; runtime hipBLAS is banned because its heuristic collapses on the skinny ternary GEMV shape the models use.

Kernels: overview

KernelRole
ternary_gemvPacked ternary × FP16 GEMV · hot path
attention_fdSplit-KV Flash-Decoding · per-head parallel
ropeRotary position embedding · HF split-half
rmsnormRoot-mean-square normalization
siluActivation · SwiGLU companion
kv_cacheKV-cache append + retrieval

All live in rocm-cpp/src/ and rocm-cpp/kernels/. HIP C++ with wave32 WMMA intrinsics, tuned for gfx1151. A gfx1201 port for RX 9070 XT is in flight.

Ternary GEMV

Weights are stored as packed ternary values {−1, 0, +1} with a per-tensor FP16 scale. The kernel reads 2 bits per weight, multiplies by the FP16 activation, accumulates in FP32, and writes back FP16.

Current: 92% of LPDDR5 peak bandwidth on gfx1151. Memory-bandwidth-bound, not compute-bound. Reducing bytes-read per token is the #1 speedup lever; sub-1-bit formats are the research priority.

Attention · split-KV Flash-Decoding

Standard Flash-Decoding adapted for gfx1151 wave32. Each attention head splits its KV range across multiple thread-blocks; softmax statistics are combined with a log-sum-exp merge; output is written once per head.

Landed 2026-04-19 as the default in bitnet_decode. Speedup: 6.78× at context length 2048, bit-exact against the reference path.

RoPE · convention fix

Rotary Position Embedding carried a convention mismatch until 2026-04-19. The implementation used interleaved rotation; Hugging Face canonical models use split-half rotation. Fix was a six-line diff.

metricbeforeafter
PPL · wikitext-103524~12
PPL · repeated text4.291.04

Quant formats

FormatBits/weightStatus
BitNet 1.581.58Shipped (Halo v2)
TriLM1.58Shipped (experimental)
Sparse-BitNet · 3:41.25Retraining
BitNet v2 · W1.58 A41.58 W + 4 AWatching
LittleBit0.1Watching

Features

OpenAI-compatible HTTP

/v1/chat/completions, /v1/models, SSE streaming, bearer auth optional. Any OpenAI SDK works out of the box — point base_url at http://localhost:8180/v1.

Session-aware KV cache

Conversations keyed by X-1bit-Session header. KV cache is pinned per session, avoiding re-prefill on multi-turn threads.

MCP introspection

1bit-halo-mcp exposes model listing, health probes, KV-cache stats, sampler overrides, and kernel timing as Model Context Protocol tools. Attach from Claude Desktop, Claude Code, or any MCP client.

Local by topology

Every byte stays on the machine. No telemetry, no dial-home, no usage analytics. Caller-side clients may hit cloud APIs by choice; the serving path does not.

· · ·

Halo v2 · BitNet 1.58

  • 2B parameters · Microsoft's public BitNet release
  • 1.58-bit weights · FP16 activations
  • Served by bitnet_decode on :8080 and 1bit-halo-server on :8180
contexttok/s
64 tokens66
1024 tokens33

Clean burn numbers from 2026-04-18, post-RoPE fix. Memory-bandwidth-bound across the whole context range.

TriLM

3.9B parameters, Apache 2.0, from the SpectraSuite (TriLM_3.9B_Unpacked). LLaMA architecture, ternary-trained from scratch. Used as a smoke-test model and as the NPU export candidate.

Sparse-BitNet

Retrain in progress on an H200 pod. Target: 1.25 effective bits per weight via 3:4 N:M sparsity layered on 1.58-bit weights.

Run 3 bailed at step 500 on a false-positive mask-integrity check (empty mask cache due to disabled monitoring). Run 4 launched 2026-04-22 with the patch — model.enable_mask_monitoring() at init, mask_cache.clear() after each verify, --save-every 100 to avoid losing progress to future bails.

Pre-bail numbers on Run 3 tracked cleanly: loss 11.04 → 5.77 across steps 50 → 500, throughput steady at 49.5k tok/s. Run 4 first checkpoint landed at step 100. 10B-token gate ETA ~57 h.

· · ·

The stack

  • Host OS: CachyOS (Arch-family, rolling) on Btrfs + snapper + limine
  • Kernel: 6.18.22-lts — pinned after an amdgpu OPTC hang on 7.0
  • ROCm: 7.x built from source against gfx1151 (not on ROCm's Tier-1 list)
  • Kernels: C++20 + HIP · wave32 WMMA · zero runtime hipBLAS
  • Orchestration: Rust 2021 · Cargo workspace · axum + tokio
  • Edge: Caddy reverse-proxy for bearer auth and TLS
  • Supervision: systemd (one unit per binary)

The machine

  • CPU: Ryzen AI Max+ 395 · Zen 5 · 16 cores / 32 threads
  • GPU: Radeon 8060S · 40 CU RDNA 3.5 · gfx1151 · wave32 WMMA
  • NPU: XDNA 2 · 50 TOPS claimed · inaccessible — see NPU status
  • Memory: 128 GB LPDDR5X-8000 · unified · ~270 GB/s peak
  • Power: 45–120 W configurable TDP envelope

A secondary RDNA 4 target (RX 9070 XT on gfx1201) lives in the ryzen mesh node. A fat-binary build covering both arches is wired in rocm-cpp; the gfx1201 WMMA intrinsic port is in flight.

Serving

1bit-halo-server is Rust + axum, serving an OpenAI-compatible surface on :8180. It forwards to bitnet_decode over FFI, manages session state, streams tokens, and enforces per-session rate limits.

PortServicePurpose
8080bitnet_decodeInternal · upstream for 1bit-halo-server
81801bit-halo-serverOpenAI-compatible HTTP · public-facing
81811bit-halo-mcpMCP server · tool & introspection surface
81901bit-halo-whisperStreaming STT (planned)
81911bit-halo-kokoroTTS (planned)

MCP

1bit-halo-mcp exposes introspection and control as a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Claude Code, Continue, Cursor, custom agents) can attach and call tools for: model listing, health probes, KV-cache stats, active session inspection, sampler overrides, and kernel timing.

22 tests cover the surface. Canonical since 2026-04-19.

NPU status

Gatekept — Strix Halo ships with an XDNA 2 NPU, but the software path is not available to this project today. The project runs on the iGPU until one of the blockers below clears.

No Linux execution provider for STX-H

AMD's Ryzen AI 1.7 Linux stack supports Strix Point (STX) and Krackan (KRK) only. Strix Halo (STX-H) has no Linux execution provider. The Windows stack exposes the NPU through a proprietary VitisAI provider that is Windows-only.

Quant format mismatch

AMD's Ryzen AI model collections ship UINT4-AWQ weights with BFP16 activations. No ternary kernel. No 1.58-bit compile path. MatMulNBits with N=4 is the only shape the AIE control-packet graph compiler accepts today.

Kernel authoring is gated

Writing native AIE kernels requires Riallto (Phoenix-only, Ubuntu 24.04.2 + Docker + paid Xilinx license, zero GEMM kernels shipped). Custom ternary kernels on XDNA 2 would need to be authored from scratch against this toolchain.

Current verdict

Defer. Revisit when AMD ships STX-H on Ryzen AI Linux, or when a community BitNet backend for XDNA 2 lands in public. Nothing on the near- or mid-term roadmap depends on the NPU shipping.

· · ·

Requirements

The stack targets Strix Halo specifically. Other gfx1100-family hardware may work with minor tweaks; only Strix Halo is tested.

Hardware — minimum

  • AMD Ryzen AI Max+ 395 (or equivalent Strix Halo SKU)
  • Radeon 8060S iGPU · gfx1151 · wave32 WMMA
  • 64 GB unified LPDDR5X minimum · 128 GB recommended for 13B+ ternary
  • 100 GB free disk for models plus build artifacts

Software — minimum

  • Linux kernel 6.18.22-lts (newer kernels carry the amdgpu OPTC hang — see troubleshooting)
  • ROCm 7.x — built from source against gfx1151 (not on ROCm's Tier-1 list)
  • LLVM / clang 18+
  • CMake 3.27+
  • Rust 1.82+ (stable channel)
  • Node.js or Bun only for caller-side clients. Nothing on the serving path — Rule A.

Recommended host

CachyOS with Btrfs + snapper + limine is the reference setup. Rollback-via-snapper has saved the project more than once. Fish shell is assumed in examples but not required.

Install

No binary distribution yet. Build from source. Packaging (AppImage + Flatpak) is on the near-term roadmap; 1bit-halo-pkg model package manager is long-term.

Build ROCm against gfx1151

System-package ROCm drops gfx1151 from Tier-1 in most distros. Build from source, or use the llamacpp-rocm fork's install script as a bootstrap.

git clone https://github.com/bong-water-water-bong/llamacpp-rocm ~/repos/llamacpp-rocm
cd ~/repos/llamacpp-rocm
./scripts/install-rocm.sh --target gfx1151

Build rocm-cpp kernel library

git clone https://github.com/bong-water-water-bong/rocm-cpp ~/repos/rocm-cpp
cd ~/repos/rocm-cpp

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DAMDGPU_TARGETS=gfx1151 \
  -DCMAKE_HIP_ARCHITECTURES=gfx1151

cmake --build build -j$(nproc)
sudo cmake --install build --prefix /usr/local

Build 1bit-halo-core (bitnet_decode)

# private repo today; public release gated on NPU ship-gate
cd ~/1bit-halo-core

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

./build/bitnet_decode --help

Build 1bit-halo-server and 1bit-halo-mcp

cd ~/1bit-halo-workspace

cargo build --release --bin 1bit-halo-server
cargo build --release --bin 1bit-halo-mcp

ls -la target/release/1bit-halo-server target/release/1bit-halo-mcp

Fetch model weights

# 1bit-halo-pkg is not shipped yet. Manual download for now:
mkdir -p ~/1bit-halo/models
cd ~/1bit-halo/models

# 1bit-halo-v2 · BitNet 1.58 · 2B
curl -LO https://.../1bit-halo-v2.h1b   # actual URL TBD

# TriLM 3.9B Unpacked (experimental)
curl -LO https://.../trilm-3.9b.h1b
Rule A reminder — Python may appear in caller-side tooling and dev-time scripts only. bitnet_decode, 1bit-halo-server, 1bit-halo-mcp, and kernel binaries ship zero Python. Carve-outs (Open WebUI, lemonade-server) are caller-side and sunset on 1bit-helm v0.3 parity.

Second target — RX 9070 XT (gfx1201)

Radeon RX 9070 XT (Navi 48, RDNA 4) lives in the ryzen mesh host and is the secondary kernel target. The build system is already multi-arch: HIP bundles per-arch code objects into a fat binary and picks at load time. Default build covers both.

The hot intrinsics — __builtin_amdgcn_wmma_*_w32, __builtin_amdgcn_sudot4, __builtin_amdgcn_sdot4 — are retained on RDNA 4. Correctness holds out of the gate. Peak throughput is not yet tuned for gfx1201; block sizes and LDS budgets are still sized for gfx1151. A fresh K-outer tile sweep is needed for GDDR6 bandwidth (~640 GB/s on 9070 XT vs ~270 GB/s LPDDR5X on Strix Halo).

Build for gfx1201

# single-arch build, 9070 XT only
GFX=gfx1201 ./install.sh

# fat-binary build, runs on both strixhalo and ryzen
GFX="gfx1151;gfx1201" ./install.sh

# auto-detect via rocminfo (use on each host natively)
GFX=auto ./install.sh

Prereq: ROCm must be present on ryzen first. Easiest path is the same TheRock source build used on Strix Halo, re-targeted to Navi 48. System-package ROCm may also work on RDNA 4 in distros that ship it; verify with rocminfo.

ssh ryzen
ls /opt/rocm* ~/therock 2>/dev/null     # confirm a ROCm dist exists
rocminfo | grep -E 'Name:|gfx'            # expect gfx1201

First run

Start bitnet_decode on the dev port, then 1bit-halo-server as the OpenAI-compatible front. Verify with curl.

Start the inference core

cd ~/1bit-halo-core
./build/bitnet_decode \
  --model ~/1bit-halo/models/1bit-halo-v2.h1b \
  --port 8080 \
  --context 4096 \
  --attn split-kv-fd \
  --rope-mode hf-split-half

Start the HTTP surface

cd ~/1bit-halo-workspace
./target/release/1bit-halo-server \
  --upstream http://127.0.0.1:8080 \
  --bind 0.0.0.0:8180

Verify

curl -s http://127.0.0.1:8180/v1/models | jq
# expect: {"data": [{"id": "1bit-halo-v2", ...}]}

curl -s http://127.0.0.1:8180/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "1bit-halo-v2",
    "messages": [{"role":"user","content":"say hello"}]
  }' | jq

Services & systemd

Production is systemd. One unit per binary. Units live in /etc/systemd/system/ (system scope) or ~/.config/systemd/user/ (user scope). LTS kernel needs LimitMEMLOCK=infinity for the inference core or pinning fails.

1bit-halo-bitnet.service

# /etc/systemd/system/1bit-halo-bitnet.service
[Unit]
Description=1bit bitnet_decode (HIP inference core)
After=network.target 1bit-halo-gpu-perf.service
Requires=1bit-halo-gpu-perf.service

[Service]
Type=simple
User=1bit-halo
Group=1bit-halo
ExecStart=/usr/local/bin/bitnet_decode \
  --model /var/lib/1bit-halo/models/1bit-halo-v2.h1b \
  --port 8080 \
  --context 4096 \
  --attn split-kv-fd \
  --rope-mode hf-split-half
Restart=on-failure
RestartSec=5
LimitMEMLOCK=infinity
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

1bit-halo-server.service

# /etc/systemd/system/1bit-halo-server.service
[Unit]
Description=1bit OpenAI-compatible HTTP surface
After=network.target 1bit-halo-bitnet.service
Requires=1bit-halo-bitnet.service

[Service]
Type=simple
User=1bit-halo
ExecStart=/usr/local/bin/1bit-halo-server \
  --upstream http://127.0.0.1:8080 \
  --bind 0.0.0.0:8180
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

1bit-halo-gpu-perf.service

Pins SCLK high to avoid latency spikes under sustained load. Required on LTS 6.18.22.

# /etc/systemd/system/1bit-halo-gpu-perf.service
[Unit]
Description=1bit GPU perf pinning (SCLK high)
After=multi-user.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/sh -c 'echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level'

[Install]
WantedBy=multi-user.target

Enable & check

sudo systemctl daemon-reload
sudo systemctl enable --now 1bit-halo-gpu-perf 1bit-halo-bitnet 1bit-halo-server

systemctl status 1bit-halo-bitnet 1bit-halo-server
journalctl -u 1bit-halo-bitnet -f

Default ports

PortServicePurpose
8080bitnet_decodeInternal dev · upstream for 1bit-halo-server
81801bit-halo-serverOpenAI-compatible HTTP · public-facing
81811bit-halo-mcpMCP server · tool & introspection surface
81901bit-halo-whisperStreaming STT (planned)
81911bit-halo-kokoroTTS (planned)

Connect — Open WebUI

Open WebUI is the blessed third-party client today. Carve-out under Rule A (caller-side only; sunsets on 1bit-helm v0.3 parity).

Docker path

docker run -d -p 3000:8080 \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8180/v1 \
  -e OPENAI_API_KEY=none \
  -v openwebui-data:/app/backend/data \
  --name openwebui \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

Native path (pipx)

pipx install open-webui
OPENAI_API_BASE_URL=http://127.0.0.1:8180/v1 \
OPENAI_API_KEY=none \
open-webui serve --port 3000

Visit http://localhost:3000, create the first admin account (stored locally), select 1bit-halo-v2 from the model dropdown.

Connect — Raw HTTP

Everything speaks OpenAI. No special client needed. Handy for smoke tests and shell scripts.

List models

curl -s http://127.0.0.1:8180/v1/models | jq '.data[].id'

One-shot completion

curl -s http://127.0.0.1:8180/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "1bit-halo-v2",
    "messages": [
      {"role":"system","content":"Be concise."},
      {"role":"user","content":"Explain ternary weights in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | jq -r '.choices[0].message.content'

Streaming (SSE)

curl -N http://127.0.0.1:8180/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "1bit-halo-v2",
    "messages": [{"role":"user","content":"count to ten"}],
    "stream": true
  }'
# server-sent events: each chunk is `data: {...}\n\n`

Connect — MCP clients

1bit-halo-mcp exposes introspection and control as a Model Context Protocol server. Any MCP-aware client (Claude Desktop, Claude Code, Continue, Cursor, custom agents) can attach.

Claude Desktop

// ~/.config/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "1bit": {
      "command": "/usr/local/bin/1bit-halo-mcp",
      "args": ["--server", "http://127.0.0.1:8180"]
    }
  }
}

Claude Code

claude mcp add 1bit /usr/local/bin/1bit-halo-mcp -- \
  --server http://127.0.0.1:8180

1bit-halo-mcp exposes tools for: model listing, health probes, KV-cache stats, active session inspection, sampler overrides (temperature, top-p, top-k), and kernel timing. 22 tests cover the surface as of 2026-04-19.

Connect — Custom / SDK

Any OpenAI SDK works. Examples below in Python (caller-side), TypeScript (caller-side), and Rust.

Python · openai-python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8180/v1",
    api_key="none",  # 1bit-halo-server ignores the key by default
)

stream = client.chat.completions.create(
    model="1bit-halo-v2",
    messages=[{"role": "user", "content": "Hello, ternary."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

TypeScript · openai (Bun-friendly)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8180/v1",
  apiKey: "none",
});

const stream = await client.chat.completions.create({
  model: "1bit-halo-v2",
  messages: [{ role: "user", content: "Hello, ternary." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Rust · async-openai

use async_openai::{Client, config::OpenAIConfig, types::*};
use futures::StreamExt;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config = OpenAIConfig::new()
        .with_api_base("http://localhost:8180/v1")
        .with_api_key("none");
    let client = Client::with_config(config);

    let req = CreateChatCompletionRequestArgs::default()
        .model("1bit-halo-v2")
        .messages([ChatCompletionRequestUserMessageArgs::default()
            .content("Hello, ternary.")
            .build()?
            .into()])
        .stream(true)
        .build()?;

    let mut stream = client.chat().create_stream(req).await?;
    while let Some(result) = stream.next().await {
        if let Ok(chunk) = result {
            if let Some(content) = &chunk.choices[0].delta.content {
                print!("{content}");
            }
        }
    }
    Ok(())
}

Add your own app

Third-party apps attach on the client side only. Rule A hard stop: no Python, no Node, no interpreted runtime inside 1bit-halo-server or downstream. Anything above it — UIs, agents, game bots, IDE plugins — is fair game in any language.

The shape of a caller-side app

  1. Speak OpenAI-compatible HTTP to :8180. Every SDK works.
  2. For richer introspection, connect to 1bit-halo-mcp at :8181.
  3. Use 1bit-halo-server's session header X-1bit-halo-Session to pin a conversation to a KV-cache slot.
  4. Handle 429 / 503 with exponential back-off — the server returns Retry-After.

Example — minimal agent harness

// minimal-agent.ts · run with `bun run minimal-agent.ts`
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8180/v1",
  apiKey: "none",
});

const session = crypto.randomUUID();
const history: OpenAI.ChatCompletionMessageParam[] = [
  { role: "system", content: "You are a terse assistant." },
];

async function turn(user: string) {
  history.push({ role: "user", content: user });
  const res = await client.chat.completions.create(
    { model: "1bit-halo-v2", messages: history },
    { headers: { "X-1bit-halo-Session": session } },
  );
  const reply = res.choices[0].message.content ?? "";
  history.push({ role: "assistant", content: reply });
  return reply;
}

console.log(await turn("Two facts about RDNA 3.5."));
console.log(await turn("And one that contradicts a common myth."));

Where your app lives

If the app is a serving surface (game integration, Discord bot, MCP bridge, API adapter), it belongs in 1bit.services, not in 1bit.systems core. Core stays kernel + serving only.

If the app is a library meant to be embedded (SDK wrapper, client helper), keep it in your own repo. The project maintains the HTTP contract; you maintain the client surface.

API stability — the OpenAI-compatible surface is the stable contract. The FFI boundary between 1bit-halo-server and bitnet_decode is internal and changes without notice. Build on HTTP, not on FFI.

Troubleshooting

Known failure modes and their fixes. Ordered by frequency, not severity.

amdgpu OPTC CRTC hang — full Wayland freeze

Symptom: compositor freezes hard under concurrent model servers. Requires power-cycle. Kernel log:

amdgpu: [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR*
  REG_WAIT timeout 1us * 100000 tries - optc35_disable_crtc

Cause: gfx1151 bug on kernel 7.x.

Fix: rollback to LTS via snapper snapshot #6 ("7.00 with claude"):

sudo snapper -c root rollback 6
sudo limine-mkconfig
sudo reboot
# at limine menu, pick 6.18.22-lts

SMU / VCN / PSP hang on LTS

Symptom: journalctl -b on boot:

amdgpu: SMU: Failed to send message 0x... rv -110  (-ETIME)
amdgpu: [PSP] Failed to load IP FW — LOAD_IP_FW failed
amdgpu: VPE / VCN powergate transition failed

Cause: /etc/modprobe.d/halo.conf Tier-3b parameters tuned for 7.0 misfire on LTS.

Fix:

sudo mv /etc/modprobe.d/halo.conf /etc/modprobe.d/halo.conf.disabled
sudo mkinitcpio -P
sudo reboot

Long-context PPL explodes

Symptom: PPL rises monotonically with context, repetition PPL > 4.

Cause: RoPE convention drift (interleaved vs HF split-half). Fixed 2026-04-19.

Fix: confirm flag and commit:

bitnet_decode --rope-mode hf-split-half   # not `interleaved`
git -C ~/1bit-halo-core log --oneline | grep -i rope
# must include the 2026-04-19 fix commit

Service won't start — mlock failed

Symptom: systemd unit exits immediately, journal shows:

bitnet_decode: mlock failed: Operation not permitted

Fix: add to the unit:

LimitMEMLOCK=infinity

The NPU probe path (xrt-smi) needs the same, for the day the gate opens.

ROCm build fails — gfx1151 not supported

Symptom: CMake reports target not supported, or linker bails on unknown arch.

Fix: pass the target explicitly everywhere:

cmake -B build \
  -DAMDGPU_TARGETS=gfx1151 \
  -DCMAKE_HIP_ARCHITECTURES=gfx1151 \
  -DGPU_TARGETS=gfx1151

If the distro ROCm drops the arch entirely, build from source. The llamacpp-rocm fork's install script is the paved road.

Latency spikes under load

Symptom: tok/s drops 30 – 60% after the first minute of sustained generation.

Cause: SCLK falls out of high-perf state.

Fix: 1bit-halo-gpu-perf.service pins SCLK high. Verify:

systemctl status 1bit-halo-gpu-perf
cat /sys/class/drm/card0/device/power_dpm_force_performance_level
# expect: high
cat /sys/class/drm/card0/device/pp_dpm_sclk

1bit-halo-memory-sync failing every 15 min

Symptom: user timer logs show:

1bit-halo-memory-sync: push failed (credentials?)

Cause: GitHub PAT expired, corrupted GH_TOKEN fish universal variable, or missing admin:public_key scope.

Fix:

# 1. clear corrupted universal env
set -e --universal GH_TOKEN

# 2. refresh auth with correct scopes
gh auth login --scopes "admin:public_key,repo,workflow"

# 3. re-run the timer
systemctl --user restart 1bit-halo-memory-sync.timer
journalctl --user -u 1bit-halo-memory-sync -f

Observability

Three levels: service logs, kernel-level profiling, and model-quality benchmarking. Everything local; no telemetry leaves the host.

Logs

# service logs
journalctl -u 1bit-halo-bitnet -f
journalctl -u 1bit-halo-server -f --since "1h ago"

# user-scope services
journalctl --user -u 1bit-halo-memory-sync.timer

Structured JSON logs with Loki + Grafana on the same host are planned; for now, journalctl -o json is the paved road.

Kernel profiling

# bandwidth-bound sanity check
rocprof --stats --timestamp on \
  ./build/bitnet_decode --model 1bit-halo-v2.h1b --port 8080 --bench 64

# expect: ternary GEMV at ~92% LPDDR5 peak
# if lower, the tile or packed layout regressed

Model quality — PPL harness

# wikitext-103 perplexity · post-RoPE-fix reference numbers:
# 1bit-halo-v2: PPL ~12 on wikitext-103, ~1.04 on repetition
./build/bitnet_decode \
  --model ~/1bit-halo/models/1bit-halo-v2.h1b \
  --ppl ~/datasets/wikitext-103/wiki.test.tokens \
  --context 2048

Live benchmark

# clean-burn reference numbers (2026-04-18):
# 64-token context:   66 tok/s
# 1024-token context: 33 tok/s
./build/bitnet_decode --bench 64 --bench 256 --bench 1024

Output conventions

On the reference host, benchmark JSON lands in /home/bcloud/claude output/. Other hosts pick their own path; the convention matters for the project's internal tracking only.

· · ·

Roadmap

  1. Sparse-BitNet retrain completion (H200 pod · 10B-token budget)
  2. BitNet v2 implementation — Hadamard-native W1.58 A4
  3. MedusaBitNet speculative heads — expected 1.4–1.8× at batch = 1
  4. gfx1201 WMMA intrinsic port for RX 9070 XT (second target)
  5. Streaming STT via halo-whisper sentence-boundary partials
  6. Image generation via sd.cpp native-HIP port, SDXL on gfx1151
  7. Video generation port — Wan 2.2 TI2V-5B (5B DiT, Apache 2.0)
  8. Desktop shell — voice-first, plugin API via MCP, package manager
  9. NPU unblock — gated on AMD

Changelog

  • 0.1.6 · 2026-04-22 — Sparse-BitNet Run 4 launched. Kernel rolled back to 6.18.22-lts. Network topology documented. gfx1201 build variant wired.
  • 0.1.5 · 2026-04-21 — TriLM INT4 ONNX export complete. NPU placement confirmed blocked. Six-crash investigation and kernel-7.0 rollback plan.
  • 0.1.4 · 2026-04-20 — Bare-metal-first lock-in. AMD and AMDResearch org scan. XDNA 2 defer verdict.
  • 0.1.3 · 2026-04-19 — RoPE split-half fix (PPL 524 → ~12). Split-KV Flash-Decoding attention (6.78× at L=2048). PPL harness landed.
  • 0.1.2 · 2026-04-18 — Sherry 1.25-bit spike committed. sd.cpp native-HIP port promoted to core.
  • 0.1.0 · 2026-04 — bitnet_decode online. Halo v2 first responding. ROCm 7.x system build against gfx1151.

Contact

Project source lives at github.com/bong-water-water-bong/1bit-systems. Open issues, discussions, and pull requests all welcome.

A Discord server exists for live discussion; the invite URL opens when the server is ready for wider links. A Patreon surface underwrites compute time (training runs, H200 pod hours, retrains) and opens when public channels open.

— end of document —