Why no Python at runtime? (Rule A)
One-line answer: Python is fine as a caller or a build-time tool. It's banned inside any systemd service, any HTTP endpoint, any hot-path process. We've been burned too many times by dependency hell, silent import failures, and 300 MB interpreter images shipped to run a 4 GB model.
Scope
Banned at runtime: anything that's the target of a systemd unit, anything serving an HTTP request, anything on the startup path of the stack.
Allowed elsewhere:
- Caller-side scripts — DSPy, lemonade-python-sdk, user's jupyter notebook.
- Build-time tooling — our
.h1brequantizer runs once to convert Microsoft's weights. It's a Python + PyTorch one-shot that writes a file. Not shipped. - Research notebooks — benchmark analysis, paper reproduction. Output is numbers or plots, not services.
What we tried first
Gen-1 of 1bit systems was Python (AMD Lemonade + transformers + a custom FastAPI layer). It was the natural starting point — Microsoft's reference is Python, MLX is Python, lemonade is Python, every BitNet community project is Python.
What we hit:
- Cold-start latency — 4-12 seconds to import torch + transformers + numpy before the first inference. On a service meant to serve LAN requests, that's a restart outage. Rust
1bit-servercold-start is ~200 ms. - Dependency churn —
pip installresolvers pick different versions day to day. One morning the Lemonade server wouldn't start because a transitive dep bumped past a compatible minor. Rust cargo's lockfile prevents this class of bug. - Silent failure modes — Python typing is advisory. JSON parsing fails at the first bad field, which may be six levels deep in an error the user sees as a 500. Rust serde fails at deserialization with a precise path.
- Multiple Pythons — the dev box had
python3.12, the install script pulledpython3.11, the AUR package expectedpython3.13. Three stacks. Three venv roots. Three reasons the service won't start on a fresh box. - Binary size — the gen-1 install weighed 1.2 GB of Python + wheels. Gen-2 Rust is 15 MB of static-ish binaries.
- Global interpreter lock — Python's concurrency story is either subprocess (slow) or asyncio (complicated, doesn't parallelize CPU-bound code). Rust
tokiohandles 1000+ concurrent HTTP connections on one thread.
The immediate wins of the Rust rewrite
- Install time:
./install-strixhalo.shis 5 minutes end-to-end, most of it spent compiling kernels. - First-request latency: under 250 ms cold, effectively 0 warm.
- Memory footprint at idle: ~80 MB for 1bit-server vs ~350 MB for the gen-1 Python process.
- Error messages: deserializer-level "expected u32 at
.usage.prompt_tokens, got string" vs Python's "KeyError: 'usage'" with no line number.
But what about speed of development?
Genuinely, Python's write-time is faster. We accept ~1.5× the implementation time in exchange for ~10× the deploy reliability. For a stack that's going to run unattended in a closet for months, the trade favors Rust.
"What about the kernels?"
Kernels are C++. Rule B, not Rule A. C++ at the kernel layer is specifically blessed because:
- The GPU compiler toolchain is C++.
- The surface is small (30 extern "C" functions).
- The FFI boundary with Rust is tight and reviewed.
"What about the Python requantizer?"
Build-time, not runtime. requantize-h1b.py reads Microsoft's .safetensors, writes our .h1b format, exits. Runs once per model release. Never shipped with the binary, never called by a service.
If we ever needed a runtime requantize (which we don't — the output is cached to disk), we'd port to Rust before running it under a systemd unit.
Enforcement
halo-install-strixhalo.shchecks for any straypythonin systemd unitExecStartlines and warns.halo doctorflags any Python process consuming >100 MB on the box at runtime.- The 1bit-mcp MCP server is Rust; the DSPy / Claude Code plugin examples are Python — that's a caller, not a service.
Citations
CLAUDE.md— the formal rule in the repo conventions.feedback_no_python_runtime.md— origin of the rule, with the first Python-service-crashes-in-prod anecdote.