Use this file to discover all available pages before exploring further.
The /v1/realtime WebSocket endpoint uses the OpenAI Realtime wire protocol. Every message is a JSON object with a type field that identifies the event. This page documents every client-to-server and server-to-client event the server currently supports, along with their payload fields and representative JSON examples.For a conceptual overview of how these events fit into the pipeline, see the Realtime API reference.
Stream raw PCM audio to the server. The server decodes the base64 payload, resamples to the internal pipeline rate of 16 kHz, buffers any partial 512-sample chunk as a remainder, and puts complete chunks on the VAD’s input queue.
Field
Type
Description
type
string
"input_audio_buffer.append"
audio
string
Base64-encoded PCM16 audio bytes. The source sample rate is read from session.audio.input.format.rate; defaults to 16 kHz if not set.
Update the session configuration. The server performs a deep-merge: only explicitly-set fields are applied, so partial updates never overwrite unset fields. The updated config is immediately visible to the VAD, LLM, and TTS handlers for the next processing cycle.
Field
Type
Description
type
string
"session.update"
session.instructions
string
System prompt injected at the start of every LLM context.
Inject an input_text message or a function_call_output into the LLM conversation context without triggering generation. A subsequent response.create is required to generate a reply.If a response is currently generating, the item is deferred: it is buffered and applied in arrival order once the active response completes. This prevents a function_call_output from arriving before its paired function_call is recorded in the chat history.
Field
Type
Description
type
string
"conversation.item.create"
item.type
string
"message" for text, "function_call_output" for tool results.
item.role
string
"user" or "system" for message items.
item.content
array
For message items: array of { "type": "input_text", "text": "..." } objects.
item.call_id
string
For function_call_output items: the call_id from the matching response.function_call_arguments.done event.
item.output
string
For function_call_output items: the serialised result of the tool execution.
{ "type": "conversation.item.create", "item": { "type": "message", "role": "user", "content": [ { "type": "input_text", "text": "What is the capital of France?" } ] }}
Trigger LLM generation immediately. For the voice-activity-detection path, generation is triggered automatically after STT completes; this event is used for text-only inputs, tool-result follow-ups, or to override generation parameters per-response.
Field
Type
Description
type
string
"response.create"
response.instructions
string
Per-response system prompt override. Merged with (or replaces) the session-level instructions.
response.tool_choice
string
"auto", "required", or "none". Only string values are currently supported.
response.input
array
Additional conversation items to include for this response only (in-band). Leave empty to continue the default conversation.
{ "type": "response.create", "response": { "instructions": "Answer in exactly one sentence.", "tool_choice": "auto" }}
Sending response.create while another response is in progress returns an error event with code conversation_already_has_active_response. Cancel the active response first with response.cancel.
Cancel the in-progress response and re-enable listening. The server stops LLM and TTS generation, flushes the output queues, and emits response.done with status="cancelled" and reason="client_cancelled".
{ "type": "error", "event_id": "event_xyz789", "error": { "type": "conversation_already_has_active_response", "message": "Cannot create response while another response is in progress." }}
Emitted by the VAD when user speech is detected. Carries the timestamp (in milliseconds from the start of the current audio buffer) at which speech began.
Field
Type
Description
type
string
"input_audio_buffer.speech_started"
event_id
string
Unique event identifier.
audio_start_ms
integer
Millisecond offset within the audio buffer where speech was detected.
item_id
string
ID of the input audio conversation item being recorded.
Acknowledgement for a successful conversation.item.create request. Also emitted when a deferred item (buffered during an active response) is applied after the response completes.
Field
Type
Description
type
string
"conversation.item.created"
event_id
string
Unique event identifier.
previous_item_id
string | null
ID of the item that immediately precedes this one in the conversation.
Final transcript for the user turn, emitted once STT has finished processing the full utterance. The transcript field is what the LLM receives as the user message.
Emitted when the server begins generating a response. For the explicit response.create path this is sent immediately; for the VAD → STT → LLM path it is sent on the first outbound audio chunk.
Field
Type
Description
type
string
"response.created"
event_id
string
Unique event identifier.
response.id
string
Unique response identifier, referenced by all subsequent response events.
response.status
string
"in_progress"
response.conversation_id
string | null
Conversation ID, or null for out-of-band responses.
response.usage
object
Token counters (zero at creation time; final values appear in response.done).
One chunk of base64-encoded PCM audio from TTS. Audio chunks are batched by the _send_loop (up to 6400 bytes per message) to reduce WebSocket frame overhead.
Audio stream complete for the current output item. Sent after the final response.output_audio.delta chunk, and also on cancellation (with no preceding delta events for that item).
Full assistant text transcript for the turn. Emitted once, after response.output_audio.done, carrying the complete LLM output text that was synthesised into audio.
Field
Type
Description
type
string
"response.output_audio_transcript.done"
event_id
string
Unique event identifier.
response_id
string
ID of the enclosing response.
item_id
string
ID of the output audio item.
output_index
integer
Index of the output item within the response.
content_index
integer
Content part index.
transcript
string
Full assistant response text.
{ "type": "response.output_audio_transcript.done", "event_id": "event_009", "response_id": "resp_abc123", "item_id": "item_out001", "output_index": 0, "content_index": 0, "transcript": "It is currently 3:42 PM UTC."}
Emitted for each tool call produced during generation. The client should execute the named function with the provided arguments, then send a conversation.item.create with type: "function_call_output" to return the result.
Field
Type
Description
type
string
"response.function_call_arguments.done"
event_id
string
Unique event identifier.
response_id
string
ID of the enclosing response.
item_id
string
ID of the output item.
output_index
integer
Index of this output item within the response.
call_id
string
Unique identifier for this tool invocation. Pass this back in conversation.item.create as call_id.
name
string
Name of the function to call.
arguments
string
JSON-encoded arguments object matching the tool’s parameter schema.
Response lifecycle complete. Sent after response.output_audio.done (for audio responses) or response.output_text.done (for text-only responses). Always emitted, even for cancelled or failed responses.
Field
Type
Description
type
string
"response.done"
event_id
string
Unique event identifier.
response.id
string
ID of the completed response.
response.status
string
"completed", "cancelled", or "failed".
response.status_details.type
string
Mirrors response.status.
response.status_details.reason
string | null
"turn_detected" (barge-in), "client_cancelled" (response.cancel), or null for "completed".
response.usage
object
Final input_tokens, output_tokens, and total_tokens for this response.
The sequence below illustrates the full round-trip for a single tool call, from the server’s response.function_call_arguments.done to a follow-up spoken response.
1
Server emits tool call
The server sends response.function_call_arguments.done with call_id, name, and arguments. The response may or may not include audio alongside the tool call, depending on whether the LLM generated a lead-in phrase.
2
Server closes the response
response.output_audio.done (if audio was generated) and response.done are sent with status="completed".
3
Client executes the tool
The client calls the function locally and serialises the result as a JSON string.
4
Client submits the result
The client sends conversation.item.create with type: "function_call_output", call_id matching the one from step 1, and output containing the result string.
5
Server acknowledges
The server appends the tool output to the conversation context and sends conversation.item.created. Generation is not triggered yet.
6
Client triggers follow-up (optional)
If the result should be spoken aloud (e.g. a search result), the client sends response.create. The LLM generates a reply that incorporates the tool output, and the TTS speaks it.For fire-and-forget actions (e.g. a robot movement command), the client can stop after conversation.item.created — the assistant has already spoken its lead-in phrase before the tool call.