Chat completions

Send a chat completion request to the Draft Thinker gateway. The gateway evaluates each response from the drafter model using token-level entropy and either accepts the draft or escalates to the heavyweight model — all transparently to the caller.

Endpoint

POST /v1/chat/completions

Authentication

The gateway has no client-side authentication. It reads OPENAI_API_KEY from its environment at startup and uses it directly for all upstream model and embedding API calls. No Authorization header is required from callers.

Request body

model

string

required

Model name. The gateway ignores this field for routing purposes — routing is decided by entropy analysis of the drafter’s output, not the model name. Set to "auto" or any value accepted by your upstream provider.

messages

object[]

required

Array of message objects that make up the conversation.

Show Message object properties

messages[].role

string

required

Role of the message author. Accepted values: "system", "user", "assistant".

messages[].content

string

required

Text content of the message.

stream

boolean

default:"false"

When true, the response is streamed as Server-Sent Events (SSE). Each event carries a StreamChunk JSON object. The stream ends with data: [DONE].

temperature

number

Sampling temperature, typically between 0 and 2. Passed through to the upstream model.

max_tokens

integer

Maximum number of tokens to generate. Passed through to the upstream model.

top_p

number

Nucleus sampling probability mass. Passed through to the upstream model.

stop

string[]

Up to 4 sequences where the model will stop generating further tokens.

user

string

A stable identifier representing the end user. Passed through to the upstream model.

Response headers

Every response includes these headers set by the gateway middleware and handler.

Header	Values	Description
`X-Routing-Decision`	`accept` \| `escalate` \| `cache_hit`	How the request was fulfilled. `accept` — drafter output used directly. `escalate` — drafter entropy was too high; heavyweight model was called. `cache_hit` — served from semantic cache.
`X-Request-Duration-Ms`	integer string	Total duration of the request in milliseconds, from receipt to final byte written.
`X-Request-ID`	hex string	Unique request identifier. Echoed from the incoming `X-Request-ID` header if provided, otherwise generated as a random 16-character hex string.

Responses

Non-streaming response

When stream is false (the default), the gateway returns a single JSON object.

string

required

Unique identifier for this completion, e.g. "chatcmpl-abc123".

object

string

required

Always "chat.completion".

created

integer

required

Unix timestamp (seconds) when the completion was created.

model

string

required

Name of the model that produced the response.

choices

object[]

required

Array of completion choices. Typically contains one element.

Show Choice object properties

choices[].index

integer

Zero-based index of this choice.

choices[].message

object

The generated message.

Show Message properties

choices[].message.role

string

Always "assistant".

choices[].message.content

string

The generated text content.

choices[].finish_reason

string

Reason the model stopped generating. Typical values: "stop", "length".

usage

object

Token usage statistics. Present when available from the upstream model.

Show Usage properties

usage.prompt_tokens

integer

Number of tokens in the prompt.

usage.completion_tokens

integer

Number of tokens in the generated completion.

usage.total_tokens

integer

Total tokens consumed (prompt_tokens + completion_tokens).

Streaming response (SSE)

When stream is true, the response uses Content-Type: text/event-stream. Each event has the form:

data: <JSON>

data: [DONE]

Each <JSON> payload is a StreamChunk object:

string

required

Completion identifier, consistent across all chunks in a stream.

object

string

required

Always "chat.completion.chunk".

created

integer

required

Unix timestamp of when the stream started.

model

string

required

Name of the model that produced this chunk.

choices

object[]

required

Array of delta choices.

Show Streaming choice properties

choices[].index

integer

Zero-based index of this choice.

choices[].delta

object

Incremental content for this chunk.

Show Delta properties

choices[].delta.role

string

Present on the first chunk only. Always "assistant".

choices[].delta.content

string

The incremental text content for this chunk.

choices[].finish_reason

string

Present on the final chunk when generation is complete. null on intermediate chunks.

Error responses

All errors follow the OpenAI error format:

{
  "error": {
    "message": "<human-readable message>",
    "type": "<error_type>"
  }
}

Status	`type`	Cause
`400 Bad Request`	`invalid_request_error`	Malformed JSON body, or `messages` array is missing or empty.
`500 Internal Server Error`	`server_error`	Internal routing error or unexpected failure.
`502 Bad Gateway`	`upstream_error`	The upstream model returned a non-success response.
`504 Gateway Timeout`	`timeout_error`	The upstream model did not respond within the configured timeout.

Examples

curl http://localhost:8080/v1/chat/completions \
  --request POST \
  --header "Content-Type: application/json" \
  --data '{
    "model": "auto",
    "messages": [
      { "role": "user", "content": "What is the capital of France?" }
    ]
  }'

Non-streaming response example

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1711670400,
  "model": "gpt-4.1-nano",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 9,
    "total_tokens": 24
  }
}

Streaming response example

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711670400,"model":"gpt-4.1-nano","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711670400,"model":"gpt-4.1-nano","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711670400,"model":"gpt-4.1-nano","choices":[{"index":0,"delta":{"content":" capital of France is Paris."},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1711670400,"model":"gpt-4.1-nano","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

The gateway always requests logprobs from the drafter model internally to compute token entropy. This does not affect the response you receive — logprob fields are not forwarded to clients.

Endpoints

Calibration

Endpoint

Authentication

Request body

Response headers

Responses

Non-streaming response

Streaming response (SSE)

Error responses

Examples

Non-streaming response example

Streaming response example

Build docs developers (and LLMs) love

Endpoints

Calibration

​Endpoint

​Authentication

​Request body

​Response headers

​Responses

​Non-streaming response

​Streaming response (SSE)

​Error responses

​Examples

​Non-streaming response example

​Streaming response example

Build docs developers (and LLMs) love

Endpoint

Authentication

Request body

Response headers

Responses

Non-streaming response

Streaming response (SSE)

Error responses

Examples

Non-streaming response example

Streaming response example