Native Zig inference server with a macOS menu bar app. OpenAI-compatible API. No Python. Just fast.
A complete inference stack from server to UI, built from scratch in Zig and Swift.
Written in Zig with direct MLX-C bindings. No Python runtime, no overhead. KV cache reuse across requests for instant multi-turn.
Drop-in replacement. Chat completions, streaming, tool calling, embeddings, logprobs. Works with any OpenAI client library.
7 built-in tools: shell, file read/write/edit, search, web browse, web search. Extend with prompt-based skills — just drop a markdown file.
Native macOS app lives in your menu bar. Download models from HuggingFace with resumable transfers. Chat, browse, and manage from one place.
Real-time SSE streaming with automatic tool call detection. The model can call functions, get results, and continue reasoning — all in one request.
Teach the agent new capabilities by dropping markdown files in a folder. No code needed — just describe the workflow and the agent follows it.
# Download a release or build from source git clone https://github.com/ddalcu/mlx-serve cd mlx-serve zig build -Doptimize=ReleaseFast # Start the server ./zig-out/bin/mlx-serve \ --model ~/models/gemma-4-4b-it-4bit \ --serve --port 8080
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "Hello!"}], "stream": true }'
Supports quantized MLX-format models. Download directly from HuggingFace in the app.
Google · 4B, 12B, 27B
Alibaba · MoE · 4B, 14B, 32B
Meta · 8B, 70B
Mistral AI · 7B, 8x7B