Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

The /v1/realtime WebSocket endpoint uses the OpenAI Realtime wire protocol. Every message is a JSON object with a type field that identifies the event. This page documents every client-to-server and server-to-client event the server currently supports, along with their payload fields and representative JSON examples. For a conceptual overview of how these events fit into the pipeline, see the Realtime API reference.

Client → Server events

Clients send these events over the established WebSocket connection. Unknown type values produce an error event with code unknown_or_invalid_event.

input_audio_buffer.append

Stream raw PCM audio to the server. The server decodes the base64 payload, resamples to the internal pipeline rate of 16 kHz, buffers any partial 512-sample chunk as a remainder, and puts complete chunks on the VAD’s input queue.
FieldTypeDescription
typestring"input_audio_buffer.append"
audiostringBase64-encoded PCM16 audio bytes. The source sample rate is read from session.audio.input.format.rate; defaults to 16 kHz if not set.
{
  "type": "input_audio_buffer.append",
  "audio": "AAAAAAAAAAAA..."
}

session.update

Update the session configuration. The server performs a deep-merge: only explicitly-set fields are applied, so partial updates never overwrite unset fields. The updated config is immediately visible to the VAD, LLM, and TTS handlers for the next processing cycle.
FieldTypeDescription
typestring"session.update"
session.instructionsstringSystem prompt injected at the start of every LLM context.
session.toolsarrayJSON Schema tool definitions (see Tool calling).
session.voicestringTTS voice identifier passed to the active TTS backend.
session.turn_detection.typestringMust be "server_vad".
session.turn_detection.interrupt_responsebooleanWhether new user speech cancels an in-progress response. Defaults to true.
session.audioobjectInput and output audio format configuration. The server resamples internally to 16 kHz regardless of the chosen format.
{
  "type": "session.update",
  "session": {
    "instructions": "You are a helpful assistant. Be concise.",
    "voice": "af_heart",
    "turn_detection": {
      "type": "server_vad",
      "interrupt_response": true
    },
    "tools": [
      {
        "type": "function",
        "name": "get_time",
        "description": "Return the current UTC time.",
        "parameters": {
          "type": "object",
          "properties": {},
          "required": []
        }
      }
    ]
  }
}
Only realtime session types are accepted. Sending a RealtimeTranscriptionSessionCreateRequest returns an error event with code invalid_session_type.

conversation.item.create

Inject an input_text message or a function_call_output into the LLM conversation context without triggering generation. A subsequent response.create is required to generate a reply. If a response is currently generating, the item is deferred: it is buffered and applied in arrival order once the active response completes. This prevents a function_call_output from arriving before its paired function_call is recorded in the chat history.
FieldTypeDescription
typestring"conversation.item.create"
item.typestring"message" for text, "function_call_output" for tool results.
item.rolestring"user" or "system" for message items.
item.contentarrayFor message items: array of { "type": "input_text", "text": "..." } objects.
item.call_idstringFor function_call_output items: the call_id from the matching response.function_call_arguments.done event.
item.outputstringFor function_call_output items: the serialised result of the tool execution.
{
  "type": "conversation.item.create",
  "item": {
    "type": "message",
    "role": "user",
    "content": [
      { "type": "input_text", "text": "What is the capital of France?" }
    ]
  }
}

response.create

Trigger LLM generation immediately. For the voice-activity-detection path, generation is triggered automatically after STT completes; this event is used for text-only inputs, tool-result follow-ups, or to override generation parameters per-response.
FieldTypeDescription
typestring"response.create"
response.instructionsstringPer-response system prompt override. Merged with (or replaces) the session-level instructions.
response.tool_choicestring"auto", "required", or "none". Only string values are currently supported.
response.inputarrayAdditional conversation items to include for this response only (in-band). Leave empty to continue the default conversation.
{
  "type": "response.create",
  "response": {
    "instructions": "Answer in exactly one sentence.",
    "tool_choice": "auto"
  }
}
Sending response.create while another response is in progress returns an error event with code conversation_already_has_active_response. Cancel the active response first with response.cancel.

response.cancel

Cancel the in-progress response and re-enable listening. The server stops LLM and TTS generation, flushes the output queues, and emits response.done with status="cancelled" and reason="client_cancelled".
FieldTypeDescription
typestring"response.cancel"
{
  "type": "response.cancel"
}

Server → Client events

The server sends these events over the WebSocket connection. Every event includes an event_id field with a unique identifier string.

session.created

Sent immediately after the WebSocket connection is accepted, before any other event. Contains the full current session configuration.
FieldTypeDescription
typestring"session.created"
event_idstringUnique event identifier.
sessionobjectCurrent RealtimeSessionCreateRequest — includes all fields from the active RuntimeConfig.
{
  "type": "session.created",
  "event_id": "event_abc123",
  "session": {
    "type": "realtime",
    "instructions": null,
    "voice": null,
    "turn_detection": null,
    "tools": []
  }
}

error

Sent when a protocol or processing error occurs. The error.type field contains a machine-readable code.
FieldTypeDescription
typestring"error"
event_idstringUnique event identifier.
error.typestringMachine-readable error code (see table below).
error.messagestringHuman-readable description of the error.
Error codeTrigger
session_limit_reachedAll pipeline slots are occupied; new connection rejected.
unknown_or_invalid_eventClient sent an event with an unrecognised or missing type field.
invalid_session_typesession.update targeted a transcription session, which is not supported.
conversation_already_has_active_responseresponse.create was sent while another response is still active.
response_failedGeneration failed (e.g. invalid out-of-band input, provider rejected empty context).
{
  "type": "error",
  "event_id": "event_xyz789",
  "error": {
    "type": "conversation_already_has_active_response",
    "message": "Cannot create response while another response is in progress."
  }
}

input_audio_buffer.speech_started

Emitted by the VAD when user speech is detected. Carries the timestamp (in milliseconds from the start of the current audio buffer) at which speech began.
FieldTypeDescription
typestring"input_audio_buffer.speech_started"
event_idstringUnique event identifier.
audio_start_msintegerMillisecond offset within the audio buffer where speech was detected.
item_idstringID of the input audio conversation item being recorded.
{
  "type": "input_audio_buffer.speech_started",
  "event_id": "event_001",
  "audio_start_ms": 320,
  "item_id": "item_aaa111"
}

input_audio_buffer.speech_stopped

Emitted by the VAD when the end of a speech segment is detected. After this event, the full utterance audio is forwarded to STT.
FieldTypeDescription
typestring"input_audio_buffer.speech_stopped"
event_idstringUnique event identifier.
audio_end_msintegerMillisecond offset within the audio buffer where speech ended.
item_idstringID of the input audio conversation item.
{
  "type": "input_audio_buffer.speech_stopped",
  "event_id": "event_002",
  "audio_end_ms": 2140,
  "item_id": "item_aaa111"
}

conversation.item.created

Acknowledgement for a successful conversation.item.create request. Also emitted when a deferred item (buffered during an active response) is applied after the response completes.
FieldTypeDescription
typestring"conversation.item.created"
event_idstringUnique event identifier.
previous_item_idstring | nullID of the item that immediately precedes this one in the conversation.
itemobjectThe ConversationItem that was injected.
{
  "type": "conversation.item.created",
  "event_id": "event_003",
  "previous_item_id": "item_aaa111",
  "item": {
    "id": "item_bbb222",
    "type": "message",
    "role": "user",
    "content": [{ "type": "input_text", "text": "What time is it?" }]
  }
}

conversation.item.input_audio_transcription.delta

Streaming partial transcript emitted by TranscriptionNotifier as STT tokens arrive. Only sent when --enable_live_transcription is active.
FieldTypeDescription
typestring"conversation.item.input_audio_transcription.delta"
event_idstringUnique event identifier.
item_idstringID of the in-progress input audio item.
content_indexintegerIndex within the item’s content array.
deltastringIncremental transcript text.
{
  "type": "conversation.item.input_audio_transcription.delta",
  "event_id": "event_004",
  "item_id": "item_aaa111",
  "content_index": 0,
  "delta": "What time"
}

conversation.item.input_audio_transcription.completed

Final transcript for the user turn, emitted once STT has finished processing the full utterance. The transcript field is what the LLM receives as the user message.
FieldTypeDescription
typestring"conversation.item.input_audio_transcription.completed"
event_idstringUnique event identifier.
item_idstringID of the input audio conversation item.
content_indexintegerIndex within the item’s content array.
transcriptstringFull recognised transcript text.
usage.typestring"duration" — indicates the usage metric is measured in seconds.
usage.secondsnumberDuration of the audio segment in seconds.
{
  "type": "conversation.item.input_audio_transcription.completed",
  "event_id": "event_005",
  "item_id": "item_aaa111",
  "content_index": 0,
  "transcript": "What time is it right now?",
  "usage": {
    "type": "duration",
    "seconds": 2.14
  }
}

response.created

Emitted when the server begins generating a response. For the explicit response.create path this is sent immediately; for the VAD → STT → LLM path it is sent on the first outbound audio chunk.
FieldTypeDescription
typestring"response.created"
event_idstringUnique event identifier.
response.idstringUnique response identifier, referenced by all subsequent response events.
response.statusstring"in_progress"
response.conversation_idstring | nullConversation ID, or null for out-of-band responses.
response.usageobjectToken counters (zero at creation time; final values appear in response.done).
{
  "type": "response.created",
  "event_id": "event_006",
  "response": {
    "id": "resp_abc123",
    "object": "realtime.response",
    "status": "in_progress",
    "conversation_id": "conv_xyz",
    "usage": {
      "input_tokens": 0,
      "output_tokens": 0,
      "total_tokens": 0
    }
  }
}

response.output_audio.delta

One chunk of base64-encoded PCM audio from TTS. Audio chunks are batched by the _send_loop (up to 6400 bytes per message) to reduce WebSocket frame overhead.
FieldTypeDescription
typestring"response.output_audio.delta"
event_idstringUnique event identifier.
response_idstringID of the enclosing response.
item_idstringID of the output audio item.
output_indexintegerIndex of the output item within the response.
content_indexintegerContent part index within the output item.
deltastringBase64-encoded PCM16 audio. Output sample rate reflects session.audio.output.format.rate (default 16 kHz).
{
  "type": "response.output_audio.delta",
  "event_id": "event_007",
  "response_id": "resp_abc123",
  "item_id": "item_out001",
  "output_index": 0,
  "content_index": 0,
  "delta": "AAAAAAAAAAAA..."
}

response.output_audio.done

Audio stream complete for the current output item. Sent after the final response.output_audio.delta chunk, and also on cancellation (with no preceding delta events for that item).
FieldTypeDescription
typestring"response.output_audio.done"
event_idstringUnique event identifier.
response_idstringID of the enclosing response.
item_idstringID of the output audio item.
output_indexintegerIndex of the output item within the response.
content_indexintegerContent part index.
{
  "type": "response.output_audio.done",
  "event_id": "event_008",
  "response_id": "resp_abc123",
  "item_id": "item_out001",
  "output_index": 0,
  "content_index": 0
}

response.output_audio_transcript.done

Full assistant text transcript for the turn. Emitted once, after response.output_audio.done, carrying the complete LLM output text that was synthesised into audio.
FieldTypeDescription
typestring"response.output_audio_transcript.done"
event_idstringUnique event identifier.
response_idstringID of the enclosing response.
item_idstringID of the output audio item.
output_indexintegerIndex of the output item within the response.
content_indexintegerContent part index.
transcriptstringFull assistant response text.
{
  "type": "response.output_audio_transcript.done",
  "event_id": "event_009",
  "response_id": "resp_abc123",
  "item_id": "item_out001",
  "output_index": 0,
  "content_index": 0,
  "transcript": "It is currently 3:42 PM UTC."
}

response.function_call_arguments.done

Emitted for each tool call produced during generation. The client should execute the named function with the provided arguments, then send a conversation.item.create with type: "function_call_output" to return the result.
FieldTypeDescription
typestring"response.function_call_arguments.done"
event_idstringUnique event identifier.
response_idstringID of the enclosing response.
item_idstringID of the output item.
output_indexintegerIndex of this output item within the response.
call_idstringUnique identifier for this tool invocation. Pass this back in conversation.item.create as call_id.
namestringName of the function to call.
argumentsstringJSON-encoded arguments object matching the tool’s parameter schema.
{
  "type": "response.function_call_arguments.done",
  "event_id": "event_010",
  "response_id": "resp_abc123",
  "item_id": "item_out002",
  "output_index": 1,
  "call_id": "call_xyz456",
  "name": "get_weather",
  "arguments": "{\"city\": \"Paris\"}"
}

response.done

Response lifecycle complete. Sent after response.output_audio.done (for audio responses) or response.output_text.done (for text-only responses). Always emitted, even for cancelled or failed responses.
FieldTypeDescription
typestring"response.done"
event_idstringUnique event identifier.
response.idstringID of the completed response.
response.statusstring"completed", "cancelled", or "failed".
response.status_details.typestringMirrors response.status.
response.status_details.reasonstring | null"turn_detected" (barge-in), "client_cancelled" (response.cancel), or null for "completed".
response.usageobjectFinal input_tokens, output_tokens, and total_tokens for this response.
{
  "type": "response.done",
  "event_id": "event_011",
  "response": {
    "id": "resp_abc123",
    "object": "realtime.response",
    "status": "completed",
    "status_details": { "type": "completed", "reason": null },
    "usage": {
      "input_tokens": 42,
      "output_tokens": 18,
      "total_tokens": 60
    }
  }
}

Tool result flow

The sequence below illustrates the full round-trip for a single tool call, from the server’s response.function_call_arguments.done to a follow-up spoken response.
1

Server emits tool call

The server sends response.function_call_arguments.done with call_id, name, and arguments. The response may or may not include audio alongside the tool call, depending on whether the LLM generated a lead-in phrase.
2

Server closes the response

response.output_audio.done (if audio was generated) and response.done are sent with status="completed".
3

Client executes the tool

The client calls the function locally and serialises the result as a JSON string.
4

Client submits the result

The client sends conversation.item.create with type: "function_call_output", call_id matching the one from step 1, and output containing the result string.
5

Server acknowledges

The server appends the tool output to the conversation context and sends conversation.item.created. Generation is not triggered yet.
6

Client triggers follow-up (optional)

If the result should be spoken aloud (e.g. a search result), the client sends response.create. The LLM generates a reply that incorporates the tool output, and the TTS speaks it.For fire-and-forget actions (e.g. a robot movement command), the client can stop after conversation.item.created — the assistant has already spoken its lead-in phrase before the tool call.

Event ordering summary

The table below shows the typical sequence of events for a complete VAD-triggered voice turn with one tool call.
#DirectionEvent
1Server → Clientinput_audio_buffer.speech_started
2Server → Clientinput_audio_buffer.speech_stopped
3Server → Clientconversation.item.input_audio_transcription.delta (repeated)
4Server → Clientconversation.item.input_audio_transcription.completed
5Server → Clientresponse.created
6Server → Clientresponse.output_audio.delta (repeated)
7Server → Clientresponse.function_call_arguments.done
8Server → Clientresponse.output_audio.done
9Server → Clientresponse.output_audio_transcript.done
10Server → Clientresponse.done
11Client → Serverconversation.item.create (function_call_output)
12Server → Clientconversation.item.created
13Client → Serverresponse.create (optional follow-up)
14Server → Clientresponse.created, audio deltas, response.done

Build docs developers (and LLMs) love