AMD Strix Halo · gfx1151 + XDNA 2 · inference engine first

Local inference, wired for Strix Halo. One OpenAI-compatible endpoint while the control plane is rebuilt.

The useful shape is simple: apps talk to one local endpoint. Today the most reliable Strix Halo repair path is a toolbox-backed llama-server on :13305, with 1bit-proxy on :13306 as the stable OpenAI-compatible app surface. Native Lemonade and FastFlowLM remain product-direction lanes, not a finished one-click control plane.

Install GitHub

routing surface local only

union 127.0.0.1:13306/v1 OpenAI-compatible app surface

backend:13305llama.cpp / Lemonade

npu lane:52625FastFlowLM optional

web ui:3000secondary client

control1bitCLI + lifecycle

curl http://127.0.0.1:13306/v1/models

Inference engine

The engine is the product surface: OpenAI-compatible apps send requests to a backend through the union endpoint. For the repair path, use the Strix Halo toolboxes first; the single control plane is still roadmap work.

Toolbox llama.cpp

Recommended first backend on Ubuntu/Fedora: kyuz0/amd-strix-halo-toolboxes:vulkan-radv, then rocm-7.2.2 after device access is verified.

Lemonade

Native multimodal and OmniRouter lane for the Arch/CachyOS path. It remains product direction, but toolbox llama-server can occupy :13305 during repair.

FastFlowLM

Optional XDNA NPU side lane on http://127.0.0.1:52625/v1 when the host NPU stack is actually healthy.

1bit proxy

Convenience union endpoint on http://127.0.0.1:13306/api/v1 and /v1. It is the stable app surface while backend lifecycle is rebuilt.

Apps

GAIA, Open WebUI, AnythingLLM, Continue, Dify, n8n, and custom SDK clients connect by setting an OpenAI-compatible base URL.

Control plane

1bit, GAIA, Open WebUI, systemd, and toolbox lifecycle are the intended control plane pieces. They are not yet one finished operator surface.

Open WebUI

Secondary browser UI on :3000, pointed at the union endpoint by the systemd unit.

Install

On Ubuntu/Fedora, start with toolbox-backed inference. The native installer is currently Arch/CachyOS-first and should not be treated as the universal bootstrap.

# Fedora toolbox: compatibility-first backend
toolbox create llama-vulkan-radv \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

toolbox enter llama-vulkan-radv
llama-server --host 127.0.0.1 --port 13305 -m /path/to/model.gguf -c 8192 -ngl 999 -fa 1 --no-mmap

# host
node scripts/1bit-proxy.js
curl -s http://127.0.0.1:13306/v1/models

After vulkan-radv is stable, test rocm-7.2.2 with /dev/dri and /dev/kfd passed through. See kyuz0/amd-strix-halo-toolboxes and strix-halo-toolboxes.com.

Native Arch/CachyOS path:

git clone https://github.com/bong-water-water-bong/1bit-systems
cd 1bit-systems
./install.sh

# after first install, re-login or reboot so memlock limits apply
1bit up
1bit status

The installer writes the CLI, systemd units, Open WebUI configuration, memlock limits, and local service defaults for the native path. The backend-agnostic control plane still needs a registry, lifecycle checks, and toolbox start/stop support.

Quickstart

Check	Command	Expected
Backend	`llama-cli --list-devices`	Toolbox can see the Strix Halo GPU before server mode starts.
Stack	`1bit status`	Native path status where installed; toolbox-backed lifecycle is still pending.
Backend API	`curl http://127.0.0.1:13305/v1/models`	Toolbox `llama-server` or Lemonade model list.
Union	`curl http://127.0.0.1:13306/v1/models`	Lemonade plus FLM model list.
GAIA	`1bit gaia status`	AppImage path, venv CLI, and current local UI port.
Open WebUI	`1bit webui status`	Secondary UI on `http://127.0.0.1:3000`.

Connect apps

This is how other apps use the inference engine: configure an OpenAI-compatible base URL and send normal SDK requests. Use any placeholder API key unless you explicitly enabled auth.

# Recommended for GAIA/Open WebUI/clients that want both lanes
http://127.0.0.1:13306/v1

# GAIA CLI style base URL
http://127.0.0.1:13306/api/v1

# Backend direct: toolbox llama-server or Lemonade
http://127.0.0.1:13305/api/v1

# FastFlowLM direct: optional NPU runtime
http://127.0.0.1:52625/v1

Use :13305 direct when testing the active backend. Use the proxy when one OpenAI-compatible client should keep the same base URL while the backend changes from toolbox llama.cpp to native Lemonade or optional FLM routing.

This follows Lemonade's app model: local apps integrate by configuring an OpenAI-compatible base URL. Lemonade's own docs cover the API surface and app guides at lemonade-server.ai/docs/api/ and /docs/server/apps/.

App	Base URL	Why
GAIA Agent UI / CLI	`http://127.0.0.1:13306/api/v1`	GAIA follows Lemonade-style `/api/v1` while still reaching the union endpoint.
Open WebUI	`http://127.0.0.1:13306/v1`	Standard OpenAI-compatible UI surface.
AnythingLLM / Continue / Dify / n8n	`http://127.0.0.1:13306/v1`	Generic OpenAI-compatible client setup.
Custom OpenAI SDK code	`http://127.0.0.1:13306/v1`	Use normal OpenAI SDK calls against the local engine.
Direct Lemonade apps	`http://127.0.0.1:13305/api/v1`	Canonical Lemonade multimodal and OmniRouter behavior.

The five rules

The current repair stack is toolbox-backed llama.cpp or native Lemonade on :13305, optional FastFlowLM on :52625, and 1bit-proxy on :13306. These rules describe the intended product boundary, not a finished one-click control plane.

Rule A

Core serving stays Python-free. Training, notebooks, build-time conversion, caller-side tools, and isolated compatibility UIs are allowed. The proxy, kernels, native runtimes, and model hot paths stay Python-free.

Rule B

C++20 for kernels. HIP code lives in rocm-cpp/; Rust is for layers above the kernel boundary.

Rule C

hipBLAS is banned in the runtime path. Port the kernel to rocm-cpp/ instead.

Rule D

Rust 1.88+, edition 2024. Bumps require a reason.

Rule E

FastFlowLM is the intended XDNA serving lane when the NPU stack is healthy. Custom NPU kernels use IRON at author-time, then MLIR-AIE, Peano, xclbin, and libxrt from C++ at runtime.

Carve-out

Open WebUI is a secondary compatibility UI behind the union endpoint. It does not become the engine.

Apps

GAIA Agent UI

Primary UI/control client. Point it at http://127.0.0.1:13306/api/v1 when it should use the full inference engine.

Open WebUI

Secondary UI. The service exports OPENAI_API_BASE_URL=http://127.0.0.1:13306/v1.

OpenAI clients

Continue, AnythingLLM, Dify, n8n, and similar tools can use :13306/v1 with any placeholder API key.

Bench results

Recent local runs on the reference Strix Halo box.

Benchmark	Result
NPU ioctl budget, `qwen3:0.6b`	19 decoded tokens, 3879 ioctls, 204 ioctls/token, 96.3 decode tok/s. Passed threshold 250, warned above 200.
Bonsai 1.7B IQ1_S	~4828 prompt tok/s, ~284.7 gen tok/s.
Bonsai 4B IQ1_S	~1904 prompt tok/s, ~142.5 gen tok/s.
Bonsai 8B IQ1_S	~1058 prompt tok/s, ~90.8 gen tok/s.
Gianni BitNet 3B TQ2_0	~1796 prompt tok/s, ~76.1 gen tok/s.

Architecture

Apps / SDKs
  -> 1bit-proxy :13306/v1 or :13306/api/v1
       -> toolbox llama-server or Lemonade :13305/v1
       -> optional FastFlowLM :52625/v1

Open WebUI :3000 -> 1bit-proxy :13306/v1
Control plane    -> target: 1bit CLI + GAIA + systemd/toolbox lifecycle

Models

Model policy is pragmatic: use GGUF through the Strix Halo llama.cpp toolboxes first, use 1-bit and ternary GGUF where they win on the iGPU, and use FLM's q4nx/AWQ catalog only when the XDNA NPU lane is verified on the host.

Troubleshooting

1bit status
systemctl status 1bit-stack.target
systemctl status lemond.service flm.service 1bit-proxy.service open-webui.service
journalctl -u lemond.service -n 80 --no-pager
tail -80 /var/log/1bit-systems/flm.log
tail -80 /var/log/1bit-systems/1bit-proxy.log
1bit gaia logs

Probe the actual ports before changing clients: active backend :13305, optional FLM :52625, proxy :13306, Open WebUI :3000, and the dynamic GAIA UI port shown by 1bit gaia status.

FAQ

Is `:13306` the new Lemonade?

No. The proxy is a client convenience layer. During repair, :13305 may be toolbox llama-server; on the native path it may be Lemonade.

Is the NPU shipping?

Not as the universal first path. FastFlowLM can run the XDNA NPU lane on :52625 on a healthy native host, but the current out-of-box repair path is GPU-backed toolbox inference first.

Should I expose this to the internet?

No, not directly. These are local developer services. Put authentication, TLS, and explicit routing in front of anything remote.

Contributing

Keep changes aligned with the repair path: inference endpoint compatibility first, toolbox-backed Strix Halo runtime, then GAIA integration, native Lemonade/OpenAI behavior, optional FastFlowLM NPU flags, 1bit-proxy routing, backend registry work, lifecycle, and static site accuracy.

Changelog

2026-05-06: Public docs now state the toolbox-first Strix Halo repair path and mark the single control plane as unfinished roadmap work.

2026-05-03: Public docs reset to the GAIA + Lemonade + FastFlowLM architecture, with the union endpoint documented as :13306.

Local inference, wired for Strix Halo. One OpenAI-compatible endpoint while the control plane is rebuilt.

Inference engine

Toolbox llama.cpp

Lemonade

FastFlowLM

1bit proxy

Apps

Control plane

Open WebUI

Install

Quickstart

Connect apps

The five rules

Rule A

Rule B

Rule C

Rule D

Rule E

Carve-out

Apps

GAIA Agent UI

Open WebUI

OpenAI clients

Bench results

Architecture

Models

Troubleshooting

FAQ

Is :13306 the new Lemonade?

Is the NPU shipping?

Should I expose this to the internet?

Contributing

Changelog

Is `:13306` the new Lemonade?