The speech-to-speech server implements the OpenAI Realtime protocol atDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
/v1/realtime, so any client built against that API — including the official openai Python SDK and compatible web clients — connects without modification. The server listens on 0.0.0.0:8765 by default and handles everything from VAD and STT through LLM generation and TTS entirely on-device, with no outbound API calls unless you choose a remote LLM backend.
The default port is 8765. Pass
--port to override it. Any OpenAI Realtime-compatible client can connect — the Python SDK, browser WebSocket, or third-party SDKs that target the same wire protocol all work without changes.Connecting with the OpenAI Python client
Because the server speaks the OpenAI Realtime wire protocol, you can point the official Python client at it by overridingbase_url:
api_key value is ignored by the local server but must be a non-empty string to satisfy the SDK’s validation. The model parameter in connect() is recorded and logged but has no effect on which models are actually used — those are controlled by the CLI flags you pass when starting the server.
Starting the server
- Local LLM (Transformers)
- Local LLM (MLX-LM)
- Remote LLM (OpenAI-compatible API)
Session configuration (session.update)
Send a session.update event at any point during a connection to change the session configuration. The server performs a deep-merge, so only the fields you include are overwritten — unset fields retain their current values.
| Field | Type | Description |
|---|---|---|
instructions | string | System prompt injected at the start of every LLM context. |
tools | array | JSON Schema tool definitions. See Tool calling. |
voice | string | TTS voice identifier (passed to the active TTS backend). |
turn_detection.type | string | Must be "server_vad". |
turn_detection.interrupt_response | boolean | When true (default), new user speech cancels an in-progress response. Set to false to disable barge-in. |
audio | object | Input and output audio format configuration (audio.input.format and audio.output.format). The server always resamples internally to 16 kHz. |
RuntimeConfig Pydantic model. VAD reads turn_detection thresholds, LLM reads instructions and tools, and TTS reads voice — all at processing time, so a session.update takes effect immediately for the next utterance.
Pipeline architecture
Each connection is served by one isolated PipelineUnit drawn from a fixed pool. The unit owns its own queues and handler chain, so concurrent sessions never share state.Inbound audio
input_audio_buffer.append carries base64 PCM. RealtimeService decodes it, resamples to 16 kHz, and splits it into 512-sample chunks that are placed on the VAD’s input_queue.Speech detection
VAD emits
speech_started / speech_stopped events on text_output_queue. A full utterance’s audio is forwarded to STT.Transcription
STT output passes through
TranscriptionNotifier, which emits transcription.delta and transcription.completed events before handing the transcript to the LLM.LLM generation
The LLM generates text and optional tool calls.
LMOutputProcessor forwards clean text to TTS and puts assistant_text + tool call dicts on text_output_queue for the router.Concurrent sessions and the pipeline pool
Start the server with--num_pipelines N to allow up to N simultaneous connections. Each pipeline unit in the pool is completely isolated — separate queues, separate handler instances, separate RealtimeService.
When all units are busy, new connections receive a session_limit_reached error event and are immediately closed with WebSocket code 1008.
You can inspect pool state at runtime via the HTTP endpoint:
"draining" state (and is unavailable for new connections) until the SESSION_END sentinel propagates through the entire handler chain, ensuring no stale output from the previous session leaks to the next client. The draining_for_s field shows how long the unit has been draining — useful for spotting stuck handlers.
Interruption and barge-in
Whenturn_detection.interrupt_response is true (the default), user speech detected while a response is playing cancels generation immediately. The mechanism is coordinated by a CancelScope object shared between the async router and the pipeline threads.
How CancelScope works
CancelScope manages two state variables:
| Variable | Type | Purpose |
|---|---|---|
generation | int | Monotonically incrementing counter. Pipeline threads capture this at the start of each response and call is_stale(gen) on every streaming token to detect supersession. |
discarding | bool | When True, the _send_loop silently drops stale audio and assistant-text events that arrive between cancel() and the __RESPONSE_DONE__ sentinel. |
Barge-in sequence
VAD detects speech
VAD puts a
SpeechStartedEvent on text_output_queue. The _send_loop processes text events before audio (priority), so the cancellation path runs before any buffered audio chunks.Response cancelled
If a response is active and
interrupt_response is enabled, RealtimeService emits response.output_audio.done + response.done with status="cancelled" and reason="turn_detected".Cancel and flush
cancel_scope.cancel() increments the generation counter and sets discarding=True. Both output_queue (audio) and text_output_queue are flushed — __RESPONSE_DONE__ sentinels are preserved.LLM and TTS abort
Both handlers call
cancel_scope.is_stale(gen) on every streaming token. Once stale, they stop generating immediately — no timer or polling required.response.cancel calls cancel_scope.cancel() (only if a response is active), triggers finish_response(status="cancelled", reason="client_cancelled"), and re-enables should_listen.
Live transcription
When--enable_live_transcription is passed at startup, TranscriptionNotifier taps STT output and emits two event types:
| Event | Description |
|---|---|
conversation.item.input_audio_transcription.delta | Streaming partial transcript text as tokens arrive from STT. |
conversation.item.input_audio_transcription.completed | Final transcript for the user turn, including audio duration in the usage field. |
transcript field in the completed event is what gets injected into the LLM context.
Tool calling
Tools are defined insession.update as standard JSON Schema objects and are supported for both local and remote LLM backends.
Local LLM path (transformers / mlx-lm)
Tools are converted toFunctionTool objects. Each tool’s JSON Schema parameters are translated into a Python inspect.Signature, and to_code_prompt() renders a human-readable function signature:
<code> delimiters:
<code> blocks, parses name(kwargs) calls, and validates them against registered tools. Valid calls become ResponseFunctionToolCall dicts with generated call_id values.
Remote LLM path (responses-api)
Tools are passed natively as thetools= parameter to client.responses.create. The API returns structured function_call items directly — no prompt engineering or regex parsing is required. Per-response tool_choice overrides from response.create are supported.
Tool result flow
Server emits tool call
The router sends
response.function_call_arguments.done with call_id, name, and arguments to the client.Client executes the tool
The client runs the tool and sends
conversation.item.create with type: "function_call_output" and output: "<result>".Server acknowledges
RealtimeService appends the tool output to the chat context and emits conversation.item.created. This does not trigger generation.Optional follow-up generation
If the result should be spoken (e.g. a search result or sensor reading), the client sends
response.create to trigger follow-up LLM generation. For fire-and-forget actions (e.g. robot movement), the client can stop after conversation.item.created — the assistant should already have spoken the lead-in phrase before the tool call.Error codes
| Code | When emitted |
|---|---|
session_limit_reached | All pipeline slots are occupied; new connection rejected. |
unknown_or_invalid_event | The server received a client event with an unrecognised or missing type field. |
invalid_session_type | A session.update targeted a transcription session (RealtimeTranscriptionSessionCreateRequest), which is not supported — only realtime sessions are accepted. |
conversation_already_has_active_response | response.create was sent while another response is still in progress. |
response_failed | Generation failed (e.g. invalid out-of-band input, empty context rejected by provider). |