Skip to main content

Endpoint

POST /v1/messages
Send a message to the model and receive a response. Supports both streaming and non-streaming modes.

Request Body

model
string
required
The model to use for generation. Examples:
  • claude-opus-4-6-thinking
  • claude-sonnet-4-5-thinking
  • gemini-3-flash
Use GET /v1/models to see all available models.
messages
array
required
Array of message objects representing the conversation history. Each message has:
  • role (string): Either user or assistant
  • content (string | array): Message content as text or array of content blocks
[
  {
    "role": "user",
    "content": "What is the capital of France?"
  }
]
max_tokens
number
default:"4096"
Maximum number of tokens to generate in the response.For Gemini models, this is automatically capped at 16384 (Gemini’s limit).
stream
boolean
default:"false"
Enable streaming mode. When true, the response is sent as Server-Sent Events (SSE).
system
string
System instruction to guide the model’s behavior.
"system": "You are a helpful coding assistant."
tools
array
Array of tool definitions for function calling. Each tool has:
  • name (string): Tool name
  • description (string): What the tool does
  • input_schema (object): JSON Schema for tool parameters
[
  {
    "name": "search_files",
    "description": "Search for files matching a pattern",
    "input_schema": {
      "type": "object",
      "properties": {
        "pattern": { "type": "string" }
      },
      "required": ["pattern"]
    }
  }
]
tool_choice
object
Control which tool the model should use:
  • {"type": "auto"} - Model decides (default)
  • {"type": "any"} - Model must use a tool
  • {"type": "tool", "name": "tool_name"} - Use specific tool
thinking
object
Enable extended thinking for supported models:
{
  "type": "enabled",
  "budget_tokens": 10000
}
temperature
number
Sampling temperature. Higher values make output more random.
top_p
number
Nucleus sampling threshold.
top_k
number
Top-K sampling parameter (Gemini only).

Response

Non-Streaming Response

id
string
Unique message identifier.
type
string
Always "message".
role
string
Always "assistant".
content
array
Array of content blocks. Each block can be:
  • Text block: {"type": "text", "text": "..."}
  • Thinking block: {"type": "thinking", "thinking": "...", "signature": "..."}
  • Tool use block: {"type": "tool_use", "id": "...", "name": "...", "input": {...}}
model
string
The model that generated the response.
stop_reason
string
Why the model stopped generating:
  • "end_turn" - Natural completion
  • "max_tokens" - Hit token limit
  • "tool_use" - Model called a tool
  • "stop_sequence" - Hit stop sequence
usage
object
Token usage statistics:
  • input_tokens (number): Tokens in the prompt
  • output_tokens (number): Tokens generated
  • cache_creation_input_tokens (number): Tokens cached (if prompt caching is used)
  • cache_read_input_tokens (number): Tokens read from cache

Streaming Response

When stream: true, the response is sent as Server-Sent Events:
event: message_start
data: {"type":"message_start","message":{"id":"msg_01...","type":"message","role":"assistant","content":[],"model":"claude-sonnet-4-5-thinking"}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"The capital"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" of France is"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" Paris."}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":8}}

event: message_stop
data: {"type":"message_stop"}

Examples

Basic Request

curl -X POST http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5-thinking",
    "max_tokens": 1024,
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

Streaming Request

curl -X POST http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-3-flash",
    "max_tokens": 1024,
    "stream": true,
    "messages": [
      {
        "role": "user",
        "content": "Write a haiku about coding"
      }
    ]
  }'

With Tools

curl -X POST http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-opus-4-6-thinking",
    "max_tokens": 2048,
    "messages": [
      {
        "role": "user",
        "content": "Find the package.json file"
      }
    ],
    "tools": [
      {
        "name": "search_files",
        "description": "Search for files matching a pattern",
        "input_schema": {
          "type": "object",
          "properties": {
            "pattern": { "type": "string", "description": "Glob pattern" }
          },
          "required": ["pattern"]
        }
      }
    ]
  }'

Prompt Caching

The proxy automatically handles prompt caching to reduce latency and token usage:
  • Caching is organization-scoped (requires same account + session ID)
  • Session ID is derived from the SHA256 hash of the first user message
  • Cached tokens are reported in usage.cache_read_input_tokens

How It Works

  1. First request with a conversation → creates cache
  2. Subsequent requests with the same account → reads from cache
  3. If account switches → cache miss, new cache created
To maximize cache hits, use the sticky or hybrid account selection strategy.

Error Responses

400 Bad Request - Invalid Parameters

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "messages is required and must be an array"
  }
}

401 Unauthorized - Missing API Key

{
  "type": "error",
  "error": {
    "type": "authentication_error",
    "message": "Invalid or missing API key"
  }
}

503 Service Unavailable - All Accounts Exhausted

{
  "type": "error",
  "error": {
    "type": "api_error",
    "message": "No accounts available"
  }
}

400 Bad Request - Quota Exhausted

When all accounts are rate-limited for the requested model:
{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "RESOURCE_EXHAUSTED: You have exhausted your capacity on claude-opus-4-6-thinking. Quota will reset after 2h15m."
  }
}
The proxy returns 400 (not 429) for quota exhaustion to prevent clients from automatically retrying. This ensures Claude Code stops cleanly instead of entering a retry loop.

Build docs developers (and LLMs) love