POST /v1/chat/completions

The /v1/chat/completions endpoint is a drop-in replacement for the OpenAI Chat Completions API. It accepts the same request shape and returns the same response structure, so any client or library built for OpenAI works without modification — just change the base URL. The endpoint supports streaming via SSE, function/tool calling, JSON structured output, reasoning model output, and vision inputs for vision-language models (VLMs).

Request

POST /v1/chat/completions

Parameters

model

string

required

The model name or alias to use. Must match a model discovered in your model directory. You can use either the directory name or the alias configured in per-model settings. Use GET /v1/models to list available models.

messages

object[]

required

The conversation history as an array of message objects.

Show Message object properties

role

string

required

The role of the message author. One of "system", "user", "assistant", or "tool".

content

string | object[]

The message content. Can be a plain string or an array of content parts for multimodal input (text and images).

tool_calls

object[]

Tool calls made by the assistant. Present only on role: "assistant" messages when the model invoked a tool.

tool_call_id

string

The ID of the tool call this message is responding to. Required for role: "tool" messages.

name

string

Participant name rendered into the chat template. Supported by some model families such as Kimi K2.

reasoning_content

string

Explicit reasoning/thinking content from a previous turn (OpenAI reasoning_content field). Pass this back to preserve reasoning context across turns.

stream

boolean

default:"false"

If true, the server streams partial message deltas as SSE events (one data: line per token). The stream ends with data: [DONE].

stream_options

object

Options for streaming responses.

Show stream_options properties

include_usage

boolean

default:"false"

When true, the final SSE chunk includes a usage field with token counts, timing metrics, and cache hit statistics.

temperature

number

Sampling temperature. Higher values produce more random output. Overrides the per-model default when set.

top_p

number

Nucleus sampling probability cutoff. Only tokens comprising the top top_p probability mass are considered.

top_k

number

Top-k sampling: restricts sampling to the top_k most probable tokens at each step.

min_p

number

Minimum probability threshold for token sampling.

max_tokens

number

Maximum number of tokens to generate. Defaults to the server’s max_tokens setting (32768 by default).

stop

string | string[]

Stop sequence(s). Generation halts when any of these strings is produced. Accepts a single string or an array.

seed

number

Random seed for reproducible outputs. Best-effort: identical seeds on the same hardware produce identical outputs.

tools

object[]

List of tools the model may call. Each tool is an object with type: "function" and a function object containing name, description, and parameters (JSON Schema).

tool_choice

string | object

default:"auto"

Controls when the model calls a tool. "auto" lets the model decide, "none" disables tools, or pass {"type": "function", "function": {"name": "my_func"}} to force a specific tool.

response_format

object

Enforce structured output format. Set type to "json_object" to require valid JSON, or "json_schema" with a json_schema definition to enforce a specific schema.

structured_outputs

object

vLLM-compatible structured output options. Supports json (JSON schema), regex, choice, and grammar fields. Pass via extra_body in the OpenAI Python client.

chat_template_kwargs

object

Extra keyword arguments passed directly to the model’s chat template (e.g., {"enable_thinking": true}, {"reasoning_effort": "low"}).

thinking_budget

number

Maximum number of thinking tokens for reasoning models. null means unlimited. Applies when the model supports adaptive thinking.

presence_penalty

number

Penalty for token presence in the generated text so far. Positive values reduce repetition.

frequency_penalty

number

Penalty proportional to token frequency in the generated text so far.

xtc_probability

number

XTC (exclude top choices) sampling probability.

xtc_threshold

number

XTC sampling probability threshold.

Vision inputs

For vision-language models, pass image content as a content part array in the user message. Both URL and base64 data URIs are accepted:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "What is in this image?"},
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,/9j/4AAQSk...",
        "detail": "auto"
      }
    }
  ]
}

The detail field accepts "auto", "low", or "high".

Examples

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Response

string

Unique identifier for the completion, prefixed with chatcmpl-.

object

string

Always "chat.completion".

created

number

Unix timestamp of when the completion was created.

model

string

The model that generated the response.

choices

object[]

Array of completion choices.

Show Choice properties

index

number

Index of this choice in the array.

message

object

The assistant message.

Show Message properties

role

string

Always "assistant".

content

string

The generated text content. null when tool_calls is present.

reasoning_content

string

Thinking/reasoning text produced by reasoning models, extracted from <think> blocks.

tool_calls

object[]

Tool calls requested by the model. Each call has id, type: "function", and a function object with name and arguments (JSON string).

finish_reason

string

Why generation stopped. One of "stop", "length", "tool_calls".

usage

object

Token usage and optional timing metrics.

Show Usage properties

prompt_tokens

number

Number of tokens in the prompt.

completion_tokens

number

Number of tokens in the generated output.

total_tokens

number

Sum of prompt and completion tokens.

prompt_tokens_details

object

Breakdown of prompt tokens, including cached_tokens for KV cache hits.

time_to_first_token

number

Seconds from request receipt to first generated token (oMLX extension).

generation_tokens_per_second

number

Generation throughput in tokens/second (oMLX extension).

Example response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1746835200,
  "model": "your-model",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris.",
        "reasoning_content": null,
        "tool_calls": null
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 9,
    "total_tokens": 33
  }
}

Overview

Endpoints

MCP Tools API

POST /v1/chat/completions

Request

Parameters

Vision inputs

Examples

Response

Example response

Build docs developers (and LLMs) love

Overview

Endpoints

MCP Tools API

Documentation Index

​Request

​Parameters

​Vision inputs

​Examples

​Response

​Example response

Build docs developers (and LLMs) love

Request

Parameters

Vision inputs

Examples

Response

Example response