Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jundot/omlx/llms.txt

Use this file to discover all available pages before exploring further.

The /v1/messages endpoint is a drop-in replacement for the Anthropic Messages API (POST https://api.anthropic.com/v1/messages). It accepts the same request and response structure as the official Anthropic API, so the official anthropic Python SDK and any other Anthropic-compatible client works by changing only the base_url. oMLX supports adaptive thinking/reasoning, tool use, multi-image vision inputs, and SSE streaming using Anthropic’s event schema.

Request

POST /v1/messages

Parameters

model
string
required
The model name or alias to use. Accepts any model discovered in your model directory. Use GET /v1/models to list available models.
messages
object[]
required
The conversation history. Each message has a role of "user" or "assistant", and content as either a plain string or an array of content blocks.
max_tokens
number
required
Maximum number of output tokens to generate.
system
string | object[]
System prompt. Accepts a plain string or an array of SystemContent blocks (each with type: "text", text, and optional cache_control).
stream
boolean
default:"false"
If true, the server streams response events using Anthropic’s SSE schema: message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop.
temperature
number
Sampling temperature for generation.
top_p
number
Nucleus sampling cutoff probability.
top_k
number
Top-k sampling limit.
stop_sequences
string[]
List of stop strings. Generation halts when any sequence is produced.
tools
object[]
Tool definitions available to the model. Each tool has name, optional description, and input_schema (JSON Schema). Anthropic server-side tool types (e.g., web_search_20250305) are accepted for client compatibility but dropped before inference since oMLX cannot execute them locally.
tool_choice
object
Controls tool selection. {"type": "auto"} lets the model decide, {"type": "any"} forces a tool call, or {"type": "tool", "name": "my_tool"} forces a specific tool.
thinking
object
Configure adaptive thinking/reasoning.
chat_template_kwargs
object
Extra keyword arguments passed directly to the model’s chat template (e.g., {"enable_thinking": true}).
metadata
object
Optional metadata for the request. Accepted for API compatibility; not used by the server.

Token counting

To count tokens without generating a response, send a request to: POST /v1/messages/count_tokens The request body accepts the same fields as /v1/messages (minus stream, temperature, etc.) and returns:
{
  "input_tokens": 142
}

Examples

curl http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: your-secret-key" \
  -d '{
    "model": "your-model",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement simply."}
    ]
  }'

Response

id
string
Unique identifier for the message, prefixed with msg_.
type
string
Always "message".
role
string
Always "assistant".
model
string
The model that generated the response.
content
object[]
Array of content blocks. Each block has a type field:
  • text: contains text string
  • tool_use: contains id, name, and input
  • thinking: contains thinking string (reasoning models only)
stop_reason
string
Why generation stopped: "end_turn", "max_tokens", "stop_sequence", or "tool_use".
stop_sequence
string
The stop sequence that triggered the halt, if applicable.
usage
object

Example response

{
  "id": "msg_abc123",
  "type": "message",
  "role": "assistant",
  "model": "your-model",
  "content": [
    {
      "type": "text",
      "text": "Quantum entanglement is a phenomenon where two particles become linked..."
    }
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 18,
    "output_tokens": 64,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 0
  }
}

Build docs developers (and LLMs) love