POST /v1/chat/completions — chat completions endpoint

The /v1/chat/completions endpoint is the primary interface for conversational AI in MonoRelay. It accepts a list of messages and returns a model-generated reply, routing the request to the appropriate upstream provider based on your model routing configuration. The endpoint is fully compatible with the OpenAI Chat Completions API, so any client or SDK that targets OpenAI will work without modification.

Method and path

POST /v1/chat/completions

Authentication

All requests must include a valid Bearer token in the Authorization header. MonoRelay validates this against your configured access key or a JWT issued after login.

Authorization: Bearer <your-access-token>

Request body

model

string

required

The model to use. Accepts a plain model name (e.g. gpt-4o), a configured alias, or model@provider syntax to target a specific provider explicitly (e.g. gpt-4o@openai).

messages

object[]

required

An ordered list of messages representing the conversation history. Each object must have a role (system, user, assistant, or tool) and a content field (string or content-part array for vision inputs).

stream

boolean

default:"false"

When true, the response is sent as a series of Server-Sent Events (SSE) ending with data: [DONE]. Defaults to false for a single JSON response object.

temperature

number

Sampling temperature between 0 and 2. Higher values produce more random output. Cannot be used together with top_p.

top_p

number

Nucleus sampling probability mass. Only the tokens comprising the top top_p probability are considered. Cannot be used together with temperature.

number

default:"1"

Number of chat completion choices to generate for each message. Generating more than one choice increases token consumption.

stop

string | string[]

Up to four sequences where the model will stop generating further tokens. The stop sequence itself is not included in the output.

max_tokens

integer

The maximum number of tokens to generate. When omitted, the model’s default limit applies.

presence_penalty

number

Number between -2.0 and 2.0. Positive values penalize tokens that have already appeared, encouraging the model to discuss new topics.

frequency_penalty

number

Number between -2.0 and 2.0. Positive values penalize tokens proportional to how often they have appeared, reducing verbatim repetition.

tools

object[]

A list of tools the model may call. Each entry follows the OpenAI function definition schema with type, function.name, function.description, and function.parameters.

tool_choice

string | object

Controls which tool (if any) the model calls. Use "none" to disable tools, "auto" to let the model decide, or {"type": "function", "function": {"name": "..."}} to force a specific function.

response_format

object

An object specifying the output format. Set {"type": "json_object"} to enable JSON mode and guarantee the response is valid JSON. Not all providers support this field.

seed

integer

If specified, MonoRelay passes this seed to the upstream provider to encourage deterministic sampling. Identical seeds with identical parameters should produce the same output, though this is best-effort.

Streaming

When stream: true is set, MonoRelay returns an SSE stream. Each event is a data: line containing a JSON delta object in the standard OpenAI chunk format. The stream closes with a final data: [DONE] event. To consume the stream with curl, use --no-buffer to disable output buffering:

curl --no-buffer https://<host>/v1/chat/completions \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "stream": true,
    "messages": [{"role": "user", "content": "Count to 5."}]
  }'

Tool calling

MonoRelay forwards tools and tool_choice to the upstream provider unchanged. If the resolved model appears in the tool_calling.unsupported_models list in your configuration and auto_downgrade is enabled, MonoRelay automatically strips tool definitions from the request before forwarding, preventing upstream errors on models that do not support function calling.

Tool auto-downgrade is controlled by the tool_calling.auto_downgrade setting in config.yml. When enabled, requests to unsupported models silently omit tool definitions.

Examples

from openai import OpenAI

client = OpenAI(
    base_url="https://<host>/v1",
    api_key="<your-access-token>",
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is MonoRelay?"},
    ],
)
print(response.choices[0].message.content)

Error responses

Errors are returned as JSON with an error object. The HTTP status code is 503 for upstream and provider failures, and 401 for authentication errors.

{
  "error": {
    "message": "[openai] No available keys for provider 'openai'",
    "type": "no_keys"
  }
}

Common error types:

Type	Description
`no_keys`	No enabled API keys are available for the resolved provider.
`provider_disabled`	The resolved provider is disabled in configuration.
`upstream_error`	The upstream provider returned a non-2xx response.
`proxy_error`	An internal network or serialization error occurred.
`cascade_error`	All models in a cascade chain failed.

Overview

OpenAI-Compatible

Anthropic-Compatible

Management API

POST /v1/chat/completions — chat completions endpoint

Method and path

Authentication

Request body

Streaming

Tool calling

Examples

Error responses

Build docs developers (and LLMs) love

Overview

OpenAI-Compatible

Anthropic-Compatible

Management API

Documentation Index

​Method and path

​Authentication

​Request body

​Streaming

​Tool calling

​Examples

​Error responses

Build docs developers (and LLMs) love

Method and path

Authentication

Request body

Streaming

Tool calling

Examples

Error responses