Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jundot/omlx/llms.txt

Use this file to discover all available pages before exploring further.

oMLX exposes a standard OpenAI-compatible API, so any client that works with OpenAI works with oMLX — no changes required. This guide walks you through starting the server, pointing it at your models, and making your first inference request.
1

Start the server

Run omlx serve and point it at your models directory:
omlx serve --model-dir ~/models
The server starts on http://localhost:8000 by default. You should see output like:
oMLX - LLM inference, optimized for your Mac
├─ https://github.com/jundot/omlx
└─ Version: 0.3.9.dev1

Base path: /Users/you/.omlx
Model directories: /Users/you/models
Starting server at http://127.0.0.1:8000
If you installed via Homebrew, you can run oMLX as a managed background service instead: brew services start omlx. It uses ~/.omlx/models and port 8000 by default.
macOS app: Launch oMLX from your Applications folder. The Welcome screen guides you through model directory selection and server startup automatically.
2

Organize your models

oMLX discovers models from subdirectories of your --model-dir. Each subdirectory should contain a valid MLX-format model (a config.json and .safetensors files).Both flat and two-level organization are supported:
~/models/
├── Qwen3-8B-4bit/             # flat: model_id is "Qwen3-8B-4bit"
├── bge-m3/                    # flat: embedding model
└── mlx-community/             # two-level org folder
    ├── Llama-3.2-3B-Instruct-4bit/
    └── Mistral-7B-Instruct-v0.3-4bit/
oMLX auto-detects model type — LLM, VLM, embedding, or reranker — from the model’s config. You don’t need to declare types manually.
You can download MLX models directly from the admin dashboard at http://localhost:8000/admin. Search HuggingFace, check file sizes, and download with one click.
3

Make your first API call

Check which models oMLX has discovered:
curl http://localhost:8000/v1/models
Then send a chat completion request. Replace your-model-name with a model ID from the list above:
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'
For streaming responses, add "stream": true:
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [
      {"role": "user", "content": "Explain KV caching in one paragraph."}
    ],
    "stream": true
  }'

Using the Python OpenAI SDK

Any OpenAI-compatible client works by pointing base_url at your local server:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",  # leave any non-empty string if no API key is set
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
)

print(response.choices[0].message.content)
Streaming works the same way:
stream = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Count to ten."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
4

Explore the admin dashboard

Open http://localhost:8000/admin in your browser. The admin dashboard lets you:
  • Monitor loaded models, memory usage, and request throughput in real time
  • Load and unload models on demand using interactive status badges
  • Pin models to keep frequently used ones always in memory
  • Configure per-model settings — sampling parameters, chat template kwargs, TTL, model alias, and more
  • Chat directly with any loaded model, including image uploads for VLMs
  • Run benchmarks to measure prefill and generation speed with prefix cache testing
  • Download models from HuggingFace directly in the dashboard
The dashboard is fully offline — all CDN dependencies are vendored. It supports English, Korean, Japanese, Chinese, and Russian.
The built-in chat UI is available at http://localhost:8000/admin/chat if you want a quick conversational interface without any external client.

Connect coding tools

oMLX integrates with Claude Code, Codex, OpenClaw, and Pi. You can set up any of these from the admin dashboard with a single click, or use the CLI:
omlx launch codex
omlx launch opencode
omlx launch claude
The launch command checks that oMLX is running, lets you pick a model interactively if needed, and configures the tool to use your local server automatically. See Integrations for the full list of supported tools and manual configuration options.

Common CLI options

# Change the port
omlx serve --model-dir ~/models --port 8080

# Limit total model memory (prevents loading models that exceed this)
omlx serve --model-dir ~/models --max-model-memory 32GB

# Enable SSD KV cache (persists context across restarts)
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

# Set in-memory hot cache size alongside SSD cache
omlx serve --model-dir ~/models \
  --paged-ssd-cache-dir ~/.omlx/cache \
  --hot-cache-max-size 8GB

# Increase max concurrent requests (default: 8)
omlx serve --model-dir ~/models --max-concurrent-requests 16
All settings can also be configured from the admin panel at /admin and are persisted to ~/.omlx/settings.json. CLI flags always take precedence over saved settings.

Build docs developers (and LLMs) love