1bit.systems
1bit.systems

AMD Strix Halo · gfx1151 + XDNA 2 · inference engine first

Local inference, wired for Strix Halo. One OpenAI-compatible endpoint while the control plane is rebuilt.

The useful shape is simple: apps talk to one local endpoint. Today the most reliable Strix Halo repair path is a toolbox-backed llama-server on :13305, with 1bit-proxy on :13306 as the stable OpenAI-compatible app surface. Native Lemonade and FastFlowLM remain product-direction lanes, not a finished one-click control plane.

Inference engine

The engine is the product surface: OpenAI-compatible apps send requests to a backend through the union endpoint. For the repair path, use the Strix Halo toolboxes first; the single control plane is still roadmap work.

Toolbox llama.cpp

Recommended first backend on Ubuntu/Fedora: kyuz0/amd-strix-halo-toolboxes:vulkan-radv, then rocm-7.2.2 after device access is verified.

Lemonade

Native multimodal and OmniRouter lane for the Arch/CachyOS path. It remains product direction, but toolbox llama-server can occupy :13305 during repair.

FastFlowLM

Optional XDNA NPU side lane on http://127.0.0.1:52625/v1 when the host NPU stack is actually healthy.

1bit proxy

Convenience union endpoint on http://127.0.0.1:13306/api/v1 and /v1. It is the stable app surface while backend lifecycle is rebuilt.

Apps

GAIA, Open WebUI, AnythingLLM, Continue, Dify, n8n, and custom SDK clients connect by setting an OpenAI-compatible base URL.

Control plane

1bit, GAIA, Open WebUI, systemd, and toolbox lifecycle are the intended control plane pieces. They are not yet one finished operator surface.

Open WebUI

Secondary browser UI on :3000, pointed at the union endpoint by the systemd unit.

Install

On Ubuntu/Fedora, start with toolbox-backed inference. The native installer is currently Arch/CachyOS-first and should not be treated as the universal bootstrap.

# Fedora toolbox: compatibility-first backend
toolbox create llama-vulkan-radv \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

toolbox enter llama-vulkan-radv
llama-server --host 127.0.0.1 --port 13305 -m /path/to/model.gguf -c 8192 -ngl 999 -fa 1 --no-mmap

# host
node scripts/1bit-proxy.js
curl -s http://127.0.0.1:13306/v1/models

After vulkan-radv is stable, test rocm-7.2.2 with /dev/dri and /dev/kfd passed through. See kyuz0/amd-strix-halo-toolboxes and strix-halo-toolboxes.com.

Native Arch/CachyOS path:

git clone https://github.com/bong-water-water-bong/1bit-systems
cd 1bit-systems
./install.sh

# after first install, re-login or reboot so memlock limits apply
1bit up
1bit status

The installer writes the CLI, systemd units, Open WebUI configuration, memlock limits, and local service defaults for the native path. The backend-agnostic control plane still needs a registry, lifecycle checks, and toolbox start/stop support.

Quickstart

CheckCommandExpected
Backendllama-cli --list-devicesToolbox can see the Strix Halo GPU before server mode starts.
Stack1bit statusNative path status where installed; toolbox-backed lifecycle is still pending.
Backend APIcurl http://127.0.0.1:13305/v1/modelsToolbox llama-server or Lemonade model list.
Unioncurl http://127.0.0.1:13306/v1/modelsLemonade plus FLM model list.
GAIA1bit gaia statusAppImage path, venv CLI, and current local UI port.
Open WebUI1bit webui statusSecondary UI on http://127.0.0.1:3000.

Connect apps

This is how other apps use the inference engine: configure an OpenAI-compatible base URL and send normal SDK requests. Use any placeholder API key unless you explicitly enabled auth.

# Recommended for GAIA/Open WebUI/clients that want both lanes
http://127.0.0.1:13306/v1

# GAIA CLI style base URL
http://127.0.0.1:13306/api/v1

# Backend direct: toolbox llama-server or Lemonade
http://127.0.0.1:13305/api/v1

# FastFlowLM direct: optional NPU runtime
http://127.0.0.1:52625/v1

Use :13305 direct when testing the active backend. Use the proxy when one OpenAI-compatible client should keep the same base URL while the backend changes from toolbox llama.cpp to native Lemonade or optional FLM routing.

This follows Lemonade's app model: local apps integrate by configuring an OpenAI-compatible base URL. Lemonade's own docs cover the API surface and app guides at lemonade-server.ai/docs/api/ and /docs/server/apps/.

AppBase URLWhy
GAIA Agent UI / CLIhttp://127.0.0.1:13306/api/v1GAIA follows Lemonade-style /api/v1 while still reaching the union endpoint.
Open WebUIhttp://127.0.0.1:13306/v1Standard OpenAI-compatible UI surface.
AnythingLLM / Continue / Dify / n8nhttp://127.0.0.1:13306/v1Generic OpenAI-compatible client setup.
Custom OpenAI SDK codehttp://127.0.0.1:13306/v1Use normal OpenAI SDK calls against the local engine.
Direct Lemonade appshttp://127.0.0.1:13305/api/v1Canonical Lemonade multimodal and OmniRouter behavior.

The five rules

The current repair stack is toolbox-backed llama.cpp or native Lemonade on :13305, optional FastFlowLM on :52625, and 1bit-proxy on :13306. These rules describe the intended product boundary, not a finished one-click control plane.

Rule A

Core serving stays Python-free. Training, notebooks, build-time conversion, caller-side tools, and isolated compatibility UIs are allowed. The proxy, kernels, native runtimes, and model hot paths stay Python-free.

Rule B

C++20 for kernels. HIP code lives in rocm-cpp/; Rust is for layers above the kernel boundary.

Rule C

hipBLAS is banned in the runtime path. Port the kernel to rocm-cpp/ instead.

Rule D

Rust 1.88+, edition 2024. Bumps require a reason.

Rule E

FastFlowLM is the intended XDNA serving lane when the NPU stack is healthy. Custom NPU kernels use IRON at author-time, then MLIR-AIE, Peano, xclbin, and libxrt from C++ at runtime.

Carve-out

Open WebUI is a secondary compatibility UI behind the union endpoint. It does not become the engine.

Apps

GAIA Agent UI

Primary UI/control client. Point it at http://127.0.0.1:13306/api/v1 when it should use the full inference engine.

Open WebUI

Secondary UI. The service exports OPENAI_API_BASE_URL=http://127.0.0.1:13306/v1.

OpenAI clients

Continue, AnythingLLM, Dify, n8n, and similar tools can use :13306/v1 with any placeholder API key.

Bench results

Recent local runs on the reference Strix Halo box.

BenchmarkResult
NPU ioctl budget, qwen3:0.6b19 decoded tokens, 3879 ioctls, 204 ioctls/token, 96.3 decode tok/s. Passed threshold 250, warned above 200.
Bonsai 1.7B IQ1_S~4828 prompt tok/s, ~284.7 gen tok/s.
Bonsai 4B IQ1_S~1904 prompt tok/s, ~142.5 gen tok/s.
Bonsai 8B IQ1_S~1058 prompt tok/s, ~90.8 gen tok/s.
Gianni BitNet 3B TQ2_0~1796 prompt tok/s, ~76.1 gen tok/s.

Architecture

Apps / SDKs
  -> 1bit-proxy :13306/v1 or :13306/api/v1
       -> toolbox llama-server or Lemonade :13305/v1
       -> optional FastFlowLM :52625/v1

Open WebUI :3000 -> 1bit-proxy :13306/v1
Control plane    -> target: 1bit CLI + GAIA + systemd/toolbox lifecycle

Models

Model policy is pragmatic: use GGUF through the Strix Halo llama.cpp toolboxes first, use 1-bit and ternary GGUF where they win on the iGPU, and use FLM's q4nx/AWQ catalog only when the XDNA NPU lane is verified on the host.

Troubleshooting

1bit status
systemctl status 1bit-stack.target
systemctl status lemond.service flm.service 1bit-proxy.service open-webui.service
journalctl -u lemond.service -n 80 --no-pager
tail -80 /var/log/1bit-systems/flm.log
tail -80 /var/log/1bit-systems/1bit-proxy.log
1bit gaia logs

Probe the actual ports before changing clients: active backend :13305, optional FLM :52625, proxy :13306, Open WebUI :3000, and the dynamic GAIA UI port shown by 1bit gaia status.

FAQ

Is :13306 the new Lemonade?

No. The proxy is a client convenience layer. During repair, :13305 may be toolbox llama-server; on the native path it may be Lemonade.

Is the NPU shipping?

Not as the universal first path. FastFlowLM can run the XDNA NPU lane on :52625 on a healthy native host, but the current out-of-box repair path is GPU-backed toolbox inference first.

Should I expose this to the internet?

No, not directly. These are local developer services. Put authentication, TLS, and explicit routing in front of anything remote.

Contributing

Keep changes aligned with the repair path: inference endpoint compatibility first, toolbox-backed Strix Halo runtime, then GAIA integration, native Lemonade/OpenAI behavior, optional FastFlowLM NPU flags, 1bit-proxy routing, backend registry work, lifecycle, and static site accuracy.

Changelog

2026-05-06: Public docs now state the toolbox-first Strix Halo repair path and mark the single control plane as unfinished roadmap work.

2026-05-03: Public docs reset to the GAIA + Lemonade + FastFlowLM architecture, with the union endpoint documented as :13306.