Get started with oMLX in 5 minutes

oMLX exposes a standard OpenAI-compatible API, so any client that works with OpenAI works with oMLX — no changes required. This guide walks you through starting the server, pointing it at your models, and making your first inference request.

Start the server

Run omlx serve and point it at your models directory:

omlx serve --model-dir ~/models

The server starts on http://localhost:8000 by default. You should see output like:

oMLX - LLM inference, optimized for your Mac
├─ https://github.com/jundot/omlx
└─ Version: 0.3.9.dev1

Base path: /Users/you/.omlx
Model directories: /Users/you/models
Starting server at http://127.0.0.1:8000

If you installed via Homebrew, you can run oMLX as a managed background service instead: brew services start omlx. It uses ~/.omlx/models and port 8000 by default.

macOS app: Launch oMLX from your Applications folder. The Welcome screen guides you through model directory selection and server startup automatically.

Organize your models

oMLX discovers models from subdirectories of your --model-dir. Each subdirectory should contain a valid MLX-format model (a config.json and .safetensors files).Both flat and two-level organization are supported:

~/models/
├── Qwen3-8B-4bit/             # flat: model_id is "Qwen3-8B-4bit"
├── bge-m3/                    # flat: embedding model
└── mlx-community/             # two-level org folder
    ├── Llama-3.2-3B-Instruct-4bit/
    └── Mistral-7B-Instruct-v0.3-4bit/

oMLX auto-detects model type — LLM, VLM, embedding, or reranker — from the model’s config. You don’t need to declare types manually.

You can download MLX models directly from the admin dashboard at http://localhost:8000/admin. Search HuggingFace, check file sizes, and download with one click.

Make your first API call

Check which models oMLX has discovered:

curl http://localhost:8000/v1/models

Then send a chat completion request. Replace your-model-name with a model ID from the list above:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

For streaming responses, add "stream": true:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [
      {"role": "user", "content": "Explain KV caching in one paragraph."}
    ],
    "stream": true
  }'

Using the Python OpenAI SDK

Any OpenAI-compatible client works by pointing base_url at your local server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",  # leave any non-empty string if no API key is set
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
)

print(response.choices[0].message.content)

Streaming works the same way:

stream = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "Count to ten."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Explore the admin dashboard

Open http://localhost:8000/admin in your browser. The admin dashboard lets you:

Monitor loaded models, memory usage, and request throughput in real time
Load and unload models on demand using interactive status badges
Pin models to keep frequently used ones always in memory
Configure per-model settings — sampling parameters, chat template kwargs, TTL, model alias, and more
Chat directly with any loaded model, including image uploads for VLMs
Run benchmarks to measure prefill and generation speed with prefix cache testing
Download models from HuggingFace directly in the dashboard

The dashboard is fully offline — all CDN dependencies are vendored. It supports English, Korean, Japanese, Chinese, and Russian.

The built-in chat UI is available at http://localhost:8000/admin/chat if you want a quick conversational interface without any external client.

Connect coding tools

oMLX integrates with Claude Code, Codex, OpenClaw, and Pi. You can set up any of these from the admin dashboard with a single click, or use the CLI:

omlx launch codex
omlx launch opencode
omlx launch claude

The launch command checks that oMLX is running, lets you pick a model interactively if needed, and configures the tool to use your local server automatically. See Integrations for the full list of supported tools and manual configuration options.

Common CLI options

# Change the port
omlx serve --model-dir ~/models --port 8080

# Limit total model memory (prevents loading models that exceed this)
omlx serve --model-dir ~/models --max-model-memory 32GB

# Enable SSD KV cache (persists context across restarts)
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

# Set in-memory hot cache size alongside SSD cache
omlx serve --model-dir ~/models \
  --paged-ssd-cache-dir ~/.omlx/cache \
  --hot-cache-max-size 8GB

# Increase max concurrent requests (default: 8)
omlx serve --model-dir ~/models --max-concurrent-requests 16

All settings can also be configured from the admin panel at /admin and are persisted to ~/.omlx/settings.json. CLI flags always take precedence over saved settings.

Get Started

Core Features

Configuration

Integrations

Admin Dashboard

Get started with oMLX in 5 minutes

Using the Python OpenAI SDK

Connect coding tools

Common CLI options

Build docs developers (and LLMs) love

Get Started

Core Features

Configuration

Integrations

Admin Dashboard

Documentation Index

​Using the Python OpenAI SDK

​Connect coding tools

​Common CLI options

Build docs developers (and LLMs) love

Using the Python OpenAI SDK

Connect coding tools

Common CLI options