MLX Serve — Run LLMs natively on your Mac

Features

Everything you need to run local AI

A complete inference stack from server to UI, built from scratch in Zig and Swift.

Native Performance

Written in Zig with direct MLX-C bindings. No Python runtime, no overhead. KV cache reuse across requests for instant multi-turn.

OpenAI-Compatible API

Drop-in replacement. Chat completions, streaming, tool calling, embeddings, logprobs. Works with any OpenAI client library.

Built-in Agent

7 built-in tools: shell, file read/write/edit, search, web browse, web search. Extend with prompt-based skills — just drop a markdown file.

Menu Bar App

Native macOS app lives in your menu bar. Download models from HuggingFace with resumable transfers. Chat, browse, and manage from one place.

Streaming & Tool Calling

Real-time SSE streaming with automatic tool call detection. The model can call functions, get results, and continue reasoning — all in one request.

Extensible Skills

Teach the agent new capabilities by dropping markdown files in a folder. No code needed — just describe the workflow and the agent follows it.

Quick Start

Up and running in seconds

          Terminal
          
        

# Download a release or build from source
git clone https://github.com/ddalcu/mlx-serve
cd mlx-serve
zig build -Doptimize=ReleaseFast

# Start the server
./zig-out/bin/mlx-serve \
  --model ~/models/gemma-4-4b-it-4bit \
  --serve --port 8080
        

Use the API

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

MLX Claw

Desktop app for macOS

Get

Models