Endpoint
Authentication
The gateway has no client-side authentication. It readsOPENAI_API_KEY from its environment at startup and uses it directly for all upstream model and embedding API calls. No Authorization header is required from callers.
Request body
Model name. The gateway ignores this field for routing purposes — routing is decided by entropy analysis of the drafter’s output, not the model name. Set to
"auto" or any value accepted by your upstream provider.Array of message objects that make up the conversation.
When
true, the response is streamed as Server-Sent Events (SSE). Each event carries a StreamChunk JSON object. The stream ends with data: [DONE].Sampling temperature, typically between 0 and 2. Passed through to the upstream model.
Maximum number of tokens to generate. Passed through to the upstream model.
Nucleus sampling probability mass. Passed through to the upstream model.
Up to 4 sequences where the model will stop generating further tokens.
A stable identifier representing the end user. Passed through to the upstream model.
Response headers
Every response includes these headers set by the gateway middleware and handler.| Header | Values | Description |
|---|---|---|
X-Routing-Decision | accept | escalate | cache_hit | How the request was fulfilled. accept — drafter output used directly. escalate — drafter entropy was too high; heavyweight model was called. cache_hit — served from semantic cache. |
X-Request-Duration-Ms | integer string | Total duration of the request in milliseconds, from receipt to final byte written. |
X-Request-ID | hex string | Unique request identifier. Echoed from the incoming X-Request-ID header if provided, otherwise generated as a random 16-character hex string. |
Responses
Non-streaming response
Whenstream is false (the default), the gateway returns a single JSON object.
Unique identifier for this completion, e.g.
"chatcmpl-abc123".Always
"chat.completion".Unix timestamp (seconds) when the completion was created.
Name of the model that produced the response.
Array of completion choices. Typically contains one element.
Token usage statistics. Present when available from the upstream model.
Streaming response (SSE)
Whenstream is true, the response uses Content-Type: text/event-stream. Each event has the form:
<JSON> payload is a StreamChunk object:
Completion identifier, consistent across all chunks in a stream.
Always
"chat.completion.chunk".Unix timestamp of when the stream started.
Name of the model that produced this chunk.
Array of delta choices.
Error responses
All errors follow the OpenAI error format:| Status | type | Cause |
|---|---|---|
400 Bad Request | invalid_request_error | Malformed JSON body, or messages array is missing or empty. |
500 Internal Server Error | server_error | Internal routing error or unexpected failure. |
502 Bad Gateway | upstream_error | The upstream model returned a non-success response. |
504 Gateway Timeout | timeout_error | The upstream model did not respond within the configured timeout. |
Examples
Non-streaming response example
Streaming response example
The gateway always requests logprobs from the drafter model internally to compute token entropy. This does not affect the response you receive — logprob fields are not forwarded to clients.