POST /v1/messages — Anthropic API

The /v1/messages endpoint is a drop-in replacement for the Anthropic Messages API (POST https://api.anthropic.com/v1/messages). It accepts the same request and response structure as the official Anthropic API, so the official anthropic Python SDK and any other Anthropic-compatible client works by changing only the base_url. oMLX supports adaptive thinking/reasoning, tool use, multi-image vision inputs, and SSE streaming using Anthropic’s event schema.

Request

POST /v1/messages

Parameters

model

string

required

The model name or alias to use. Accepts any model discovered in your model directory. Use GET /v1/models to list available models.

messages

object[]

required

The conversation history. Each message has a role of "user" or "assistant", and content as either a plain string or an array of content blocks.

Show Content block types

type: text

object

Plain text: {"type": "text", "text": "Hello"}.

type: image

object

Image input: {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": "..."}} or {"type": "image", "source": {"type": "url", "url": "https://..."}}.

type: tool_use

object

Tool invocation by the assistant with id, name, and input.

type: tool_result

object

Tool output provided by the user with tool_use_id and content.

type: thinking

object

Reasoning content block from a previous assistant turn with thinking and optional signature.

type: document

object

Document content (PDF or plain text) with source, optional title, and context.

max_tokens

number

required

Maximum number of output tokens to generate.

system

string | object[]

System prompt. Accepts a plain string or an array of SystemContent blocks (each with type: "text", text, and optional cache_control).

stream

boolean

default:"false"

If true, the server streams response events using Anthropic’s SSE schema: message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop.

temperature

number

Sampling temperature for generation.

top_p

number

Nucleus sampling cutoff probability.

top_k

number

Top-k sampling limit.

stop_sequences

string[]

List of stop strings. Generation halts when any sequence is produced.

tools

object[]

Tool definitions available to the model. Each tool has name, optional description, and input_schema (JSON Schema). Anthropic server-side tool types (e.g., web_search_20250305) are accepted for client compatibility but dropped before inference since oMLX cannot execute them locally.

tool_choice

object

Controls tool selection. {"type": "auto"} lets the model decide, {"type": "any"} forces a tool call, or {"type": "tool", "name": "my_tool"} forces a specific tool.

thinking

object

Configure adaptive thinking/reasoning.

Show thinking properties

type

string

One of "enabled", "disabled", or "adaptive".

budget_tokens

number

Maximum thinking tokens. Applies when type is "enabled" or "adaptive".

chat_template_kwargs

object

Extra keyword arguments passed directly to the model’s chat template (e.g., {"enable_thinking": true}).

metadata

object

Optional metadata for the request. Accepted for API compatibility; not used by the server.

Token counting

To count tokens without generating a response, send a request to: POST /v1/messages/count_tokens The request body accepts the same fields as /v1/messages (minus stream, temperature, etc.) and returns:

{
  "input_tokens": 142
}

Examples

curl http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: your-secret-key" \
  -d '{
    "model": "your-model",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Explain quantum entanglement simply."}
    ]
  }'

Response

string

Unique identifier for the message, prefixed with msg_.

type

string

Always "message".

role

string

Always "assistant".

model

string

The model that generated the response.

content

object[]

Array of content blocks. Each block has a type field:

text: contains text string
tool_use: contains id, name, and input
thinking: contains thinking string (reasoning models only)

stop_reason

string

Why generation stopped: "end_turn", "max_tokens", "stop_sequence", or "tool_use".

stop_sequence

string

The stop sequence that triggered the halt, if applicable.

usage

object

Show Usage properties

input_tokens

number

Tokens in the prompt.

output_tokens

number

Tokens in the generated response.

cache_creation_input_tokens

number

Tokens written to cache (always 0 in oMLX; present for API compatibility).

cache_read_input_tokens

number

Tokens read from cache (always 0 in oMLX; present for API compatibility).

Example response

{
  "id": "msg_abc123",
  "type": "message",
  "role": "assistant",
  "model": "your-model",
  "content": [
    {
      "type": "text",
      "text": "Quantum entanglement is a phenomenon where two particles become linked..."
    }
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 18,
    "output_tokens": 64,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 0
  }
}

Overview

Endpoints

MCP Tools API

POST /v1/messages — Anthropic API

Request

Parameters

Token counting

Examples

Response

Example response

Build docs developers (and LLMs) love

Overview

Endpoints

MCP Tools API

Documentation Index

​Request

​Parameters

​Token counting

​Examples

​Response

​Example response

Build docs developers (and LLMs) love

Request

Parameters

Token counting

Examples

Response

Example response