Create chat completion
Model identifier. Accepts model ID strings, lists for routing, or DedalusModel objects with per-model settings.
Conversation history (OpenAI: messages, Google: contents, Responses: input).Each message should have:
role: “system” | “user” | “assistant” | “tool” | “function” | “developer”content: string | array of content parts
Configuration parameters
Sampling temperature (0-2 for most providers). Higher values make output more random.
Maximum tokens in completion.
Maximum tokens in completion (newer parameter name).
Nucleus sampling threshold. An alternative to sampling with temperature.
Top-k sampling parameter. Limits the number of highest probability tokens to consider.
How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices.
Random seed for deterministic output.
Enable streaming response. Set to
true to receive incremental chunks via Server-Sent Events.Options for streaming response. Only set this when you set
stream: true.System and instructions
System instruction/prompt. Defines the behavior and personality of the assistant.
Allows toggling between the reasoning mode and no system prompt. When set to
reasoning the system prompt for reasoning models will be used."reasoning": Use reasoning mode
Response formatting
An object specifying the format that the model must output.
{ "type": "json_schema", "json_schema": {...} }enables Structured Outputs which ensures the model will match your supplied JSON schema.{ "type": "json_object" }enables the older JSON mode.{ "type": "text" }for plain text (default).
Tools and function calling
Available tools/functions for the model. Each tool should have:
type: “function”function: Function definition with name, description, and parameters
Controls which (if any) tool is called by the model.
"none": Model will not call any tool"auto": Model can pick between generating a message or calling tools (default if tools are present)"required": Model must call one or more tools{"type": "function", "function": {"name": "my_function"}}: Forces the model to call that specific tool
Whether to enable parallel tool calls. Allows the model to call multiple tools simultaneously.
Execute tools server-side. If false, returns raw tool calls for manual handling.
Tool calling configuration (Google-specific).
Deprecated in favor of
tools. A list of functions the model may generate JSON inputs for.Deprecated in favor of
tool_choice. Controls which (if any) function is called by the model.MCP servers
MCP server identifiers. Accepts marketplace slugs, URLs, or MCPServerSpec objects. MCP tools are executed server-side and billed separately.
Credentials for MCP server authentication. Each credential is matched to servers by connection name.
Advanced parameters
Sequences that stop generation. Up to 4 sequences where the API will stop generating further tokens.
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics.
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim.
Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100.
Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the
content of message.An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability.
logprobs must be set to true if this parameter is used.Note: This field is being replaced by
safety_identifier and prompt_cache_key.A stable identifier for your end-users. Used to boost cache hit rates and to help detect and prevent abuse.A stable identifier used to help detect users of your application that may be violating usage policies. We recommend hashing their username or email address.
Audio and modalities
Output types that you would like the model to generate. Most models are capable of generating text, which is the default:
["text"]. The gpt-4o-audio-preview model can also generate audio. To request both: ["text", "audio"].Parameters for audio output. Required when audio output is requested with
modalities: ["audio"].Fields:voice(required): Voice ID or custom voiceformat(required): “wav” | “aac” | “mp3” | “flac” | “opus” | “pcm16”
Caching and optimization
Used to cache responses for similar requests to optimize your cache hit rates. Replaces the
user field.The retention policy for the prompt cache. Set to
24h to enable extended prompt caching, which keeps cached prefixes active for longer.Optional. The name of the cached content to use as context to serve the prediction. Format:
cachedContents/{cachedContent} (Google-specific).Reasoning and thinking
Constrains effort on reasoning for reasoning models. Currently supported values are
none, minimal, low, medium, high, and xhigh.gpt-5.1defaults tonone- Models before
gpt-5.1default tomedium gpt-5-prodefaults tohigh
Extended thinking configuration (Anthropic-specific).
Static predicted output content, such as the content of a text file that is being regenerated.Fields:
type(required): “content”content(required): string or array of text content parts
Safety and content filtering
Whether to inject a safety prompt before all conversations.
Safety/content filtering settings (Google-specific).
Content filtering and safety policy configuration.
Model routing and handoffs
Agent attributes. Values in [0.0, 1.0]. Used for model routing decisions.
Model attributes for routing. Maps model IDs to attribute dictionaries with values in [0.0, 1.0].
Configuration for multi-model handoffs.
Handoff control. None or omitted: auto-detect.
true: structured handoff (SDK). false: drop-in (LLM re-run for mixed turns).Stable session ID for resuming a previous handoff. Returned by the server on handoff; echo it on the next request to resume.
Tier 2 stateless resumption. Deferred tool specs from a previous handoff response, sent back verbatim so the server can resume without Redis.
Maximum conversation turns.
Service configuration
Service tier for request processing. Options: “auto” | “default” | “flex” | “scale” | “priority”.
The inference speed mode for this request.
"fast" enables high output-tokens-per-second inference."standard": Default speed"fast": High-speed inference
Specifies the geographic region for inference processing. If not specified, the workspace’s
default_inference_geo is used.If set to
true, the request returns a request_id. You can then get the deferred response by GET /v1/chat/deferred-completion/{request_id}.Metadata and tracking
Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format. Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
Whether or not to store the output of this chat completion request for use in model distillation or evals products.
Provider-specific parameters
Generation parameters wrapper (Google-specific).
Output configuration (provider-specific).
Set the parameters to be used for searched data. If not set, no data will be acquired by the model.
This tool searches the web for relevant results to use in a response.
Constrains the verbosity of the model’s response. Currently supported values are
low, medium, and high.Request options
Send extra headers with the request.
Add additional query parameters to the request.
Add additional JSON properties to the request body.
Override the client-level default timeout for this request, in seconds.
Specify a custom idempotency key for this request.
Response
Returns aChatCompletion object when stream=False (default), or a Stream[ChatCompletionChunk] when stream=True.
ChatCompletion fields
A unique identifier for the chat completion.
The object type, which is always
chat.completion.The Unix timestamp (in seconds) of when the chat completion was created.
The model used for the chat completion.
A list of chat completion choices. Can be more than one if
n is greater than 1.Each choice contains:index: The index of this choicemessage: The chat completion message withroleandcontentfinish_reason: “stop” | “length” | “tool_calls” | “content_filter” | nulllogprobs: Log probability information (if requested)
Usage statistics for the completion request.Fields:
prompt_tokens: Number of tokens in the promptcompletion_tokens: Number of tokens in the completiontotal_tokens: Total tokens usedcompletion_tokens_details: Breakdown of completion tokensprompt_tokens_details: Breakdown of prompt tokens
This fingerprint represents the backend configuration that the model runs with. Can be used in conjunction with the
seed request parameter to understand when backend changes have been made that might impact determinism.The processing type used for serving the request.
Dedalus-specific response fields
List of tool names that were executed server-side (e.g., MCP tools). Only present when tools were executed on the server.
Detailed results of MCP tool executions including inputs, outputs, and timing. Provides full visibility into server-side tool execution for debugging and audit purposes.
MCP server failures keyed by server name. Each error contains:
message: Human-readable error messagecode: Machine-readable error coderecommendation: Suggested action for the user
Client tools to execute, with dependency ordering. Each pending tool contains:
id: Unique identifier for this tool callname: Name of the tool to executearguments: Input arguments for the tool calldependencies: IDs of other pending calls that must complete first
Completed server tool outputs keyed by call ID.
Stable session ID for cross-turn handoff state. Echo this on the next request to resume server-side execution.
Server tools blocked on client results.
Number of internal LLM calls made during this request. SDKs can sum this across their outer loop to track total LLM calls.