Skip to main content
Send a chat completion request to the Draft Thinker gateway. The gateway evaluates each response from the drafter model using token-level entropy and either accepts the draft or escalates to the heavyweight model — all transparently to the caller.

Endpoint

POST /v1/chat/completions

Authentication

The gateway has no client-side authentication. It reads OPENAI_API_KEY from its environment at startup and uses it directly for all upstream model and embedding API calls. No Authorization header is required from callers.

Request body

model
string
required
Model name. The gateway ignores this field for routing purposes — routing is decided by entropy analysis of the drafter’s output, not the model name. Set to "auto" or any value accepted by your upstream provider.
messages
object[]
required
Array of message objects that make up the conversation.
stream
boolean
default:"false"
When true, the response is streamed as Server-Sent Events (SSE). Each event carries a StreamChunk JSON object. The stream ends with data: [DONE].
temperature
number
Sampling temperature, typically between 0 and 2. Passed through to the upstream model.
max_tokens
integer
Maximum number of tokens to generate. Passed through to the upstream model.
top_p
number
Nucleus sampling probability mass. Passed through to the upstream model.
stop
string[]
Up to 4 sequences where the model will stop generating further tokens.
user
string
A stable identifier representing the end user. Passed through to the upstream model.

Response headers

Every response includes these headers set by the gateway middleware and handler.
HeaderValuesDescription
X-Routing-Decisionaccept | escalate | cache_hitHow the request was fulfilled. accept — drafter output used directly. escalate — drafter entropy was too high; heavyweight model was called. cache_hit — served from semantic cache.
X-Request-Duration-Msinteger stringTotal duration of the request in milliseconds, from receipt to final byte written.
X-Request-IDhex stringUnique request identifier. Echoed from the incoming X-Request-ID header if provided, otherwise generated as a random 16-character hex string.

Responses

Non-streaming response

When stream is false (the default), the gateway returns a single JSON object.
id
string
required
Unique identifier for this completion, e.g. "chatcmpl-abc123".
object
string
required
Always "chat.completion".
created
integer
required
Unix timestamp (seconds) when the completion was created.
model
string
required
Name of the model that produced the response.
choices
object[]
required
Array of completion choices. Typically contains one element.
usage
object
Token usage statistics. Present when available from the upstream model.

Streaming response (SSE)

When stream is true, the response uses Content-Type: text/event-stream. Each event has the form:
data: <JSON>

data: [DONE]
Each <JSON> payload is a StreamChunk object:
id
string
required
Completion identifier, consistent across all chunks in a stream.
object
string
required
Always "chat.completion.chunk".
created
integer
required
Unix timestamp of when the stream started.
model
string
required
Name of the model that produced this chunk.
choices
object[]
required
Array of delta choices.

Error responses

All errors follow the OpenAI error format:
{
  "error": {
    "message": "<human-readable message>",
    "type": "<error_type>"
  }
}
StatustypeCause
400 Bad Requestinvalid_request_errorMalformed JSON body, or messages array is missing or empty.
500 Internal Server Errorserver_errorInternal routing error or unexpected failure.
502 Bad Gatewayupstream_errorThe upstream model returned a non-success response.
504 Gateway Timeouttimeout_errorThe upstream model did not respond within the configured timeout.

Examples

curl http://localhost:8080/v1/chat/completions \
  --request POST \
  --header "Content-Type: application/json" \
  --data '{
    "model": "auto",
    "messages": [
      { "role": "user", "content": "What is the capital of France?" }
    ]
  }'

Non-streaming response example

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1711670400,
  "model": "gpt-4.1-nano",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 9,
    "total_tokens": 24
  }
}

Streaming response example

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711670400,"model":"gpt-4.1-nano","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711670400,"model":"gpt-4.1-nano","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711670400,"model":"gpt-4.1-nano","choices":[{"index":0,"delta":{"content":" capital of France is Paris."},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711670400,"model":"gpt-4.1-nano","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]
The gateway always requests logprobs from the drafter model internally to compute token entropy. This does not affect the response you receive — logprob fields are not forwarded to clients.

Build docs developers (and LLMs) love