Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jundot/omlx/llms.txt

Use this file to discover all available pages before exploring further.

The /v1/chat/completions endpoint is a drop-in replacement for the OpenAI Chat Completions API. It accepts the same request shape and returns the same response structure, so any client or library built for OpenAI works without modification — just change the base URL. The endpoint supports streaming via SSE, function/tool calling, JSON structured output, reasoning model output, and vision inputs for vision-language models (VLMs).

Request

POST /v1/chat/completions

Parameters

model
string
required
The model name or alias to use. Must match a model discovered in your model directory. You can use either the directory name or the alias configured in per-model settings. Use GET /v1/models to list available models.
messages
object[]
required
The conversation history as an array of message objects.
stream
boolean
default:"false"
If true, the server streams partial message deltas as SSE events (one data: line per token). The stream ends with data: [DONE].
stream_options
object
Options for streaming responses.
temperature
number
Sampling temperature. Higher values produce more random output. Overrides the per-model default when set.
top_p
number
Nucleus sampling probability cutoff. Only tokens comprising the top top_p probability mass are considered.
top_k
number
Top-k sampling: restricts sampling to the top_k most probable tokens at each step.
min_p
number
Minimum probability threshold for token sampling.
max_tokens
number
Maximum number of tokens to generate. Defaults to the server’s max_tokens setting (32768 by default).
stop
string | string[]
Stop sequence(s). Generation halts when any of these strings is produced. Accepts a single string or an array.
seed
number
Random seed for reproducible outputs. Best-effort: identical seeds on the same hardware produce identical outputs.
tools
object[]
List of tools the model may call. Each tool is an object with type: "function" and a function object containing name, description, and parameters (JSON Schema).
tool_choice
string | object
default:"auto"
Controls when the model calls a tool. "auto" lets the model decide, "none" disables tools, or pass {"type": "function", "function": {"name": "my_func"}} to force a specific tool.
response_format
object
Enforce structured output format. Set type to "json_object" to require valid JSON, or "json_schema" with a json_schema definition to enforce a specific schema.
structured_outputs
object
vLLM-compatible structured output options. Supports json (JSON schema), regex, choice, and grammar fields. Pass via extra_body in the OpenAI Python client.
chat_template_kwargs
object
Extra keyword arguments passed directly to the model’s chat template (e.g., {"enable_thinking": true}, {"reasoning_effort": "low"}).
thinking_budget
number
Maximum number of thinking tokens for reasoning models. null means unlimited. Applies when the model supports adaptive thinking.
presence_penalty
number
Penalty for token presence in the generated text so far. Positive values reduce repetition.
frequency_penalty
number
Penalty proportional to token frequency in the generated text so far.
xtc_probability
number
XTC (exclude top choices) sampling probability.
xtc_threshold
number
XTC sampling probability threshold.

Vision inputs

For vision-language models, pass image content as a content part array in the user message. Both URL and base64 data URIs are accepted:
{
  "role": "user",
  "content": [
    {"type": "text", "text": "What is in this image?"},
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,/9j/4AAQSk...",
        "detail": "auto"
      }
    }
  ]
}
The detail field accepts "auto", "low", or "high".

Examples

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Response

id
string
Unique identifier for the completion, prefixed with chatcmpl-.
object
string
Always "chat.completion".
created
number
Unix timestamp of when the completion was created.
model
string
The model that generated the response.
choices
object[]
Array of completion choices.
usage
object
Token usage and optional timing metrics.

Example response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1746835200,
  "model": "your-model",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris.",
        "reasoning_content": null,
        "tool_calls": null
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 9,
    "total_tokens": 33
  }
}

Build docs developers (and LLMs) love