Realtime mode is the default operating mode for Speech to Speech. It starts a FastAPI/uvicorn server that exposes a WebSocket endpoint atDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
/v1/realtime, fully compatible with the OpenAI Realtime API protocol. Any client that speaks the OpenAI Realtime protocol — the official Python SDK, a custom client, or a voice UI library — can connect and begin streaming audio immediately. The pipeline handles VAD, STT, LLM generation, and TTS in parallel threads, streaming audio back as base64-encoded PCM delta events.
Starting the Server
Runningspeech-to-speech with no arguments launches realtime mode using Parakeet TDT for STT, the OpenAI Responses API for the LLM, and Qwen3-TTS for speech output:
The server binds to
0.0.0.0:8765 by default. The WebSocket endpoint is ws://<host>:8765/v1/realtime.Server Configuration Flags
| Flag | Default | Description |
|---|---|---|
--ws_host | 0.0.0.0 | Host IP address the WebSocket server binds to |
--ws_port | 8765 | Port the WebSocket server listens on |
--num_pipelines | 1 | Size of the isolated pipeline pool (max concurrent sessions) |
--enable_live_transcription | true | Stream partial user transcripts as transcription.delta events |
Binding to a custom host and port
Concurrent session pool
By default, only one WebSocket session is active at a time. Use--num_pipelines to create a pool of isolated VAD/STT/LLM/TTS handler chains so multiple clients can connect simultaneously. Connections beyond the pool size are rejected with a session_limit_reached error.
Connecting with the OpenAI Python Client
Any client implementing the OpenAI Realtime protocol can connect. The officialopenai Python SDK works out of the box — point base_url at your server’s HTTP address:
scripts/listen_and_play_realtime.py provides a ready-to-run microphone/speaker client:
| Flag | Default | Description |
|---|---|---|
--voice | (none) | TTS voice (e.g. bm_fable for Kokoro, marin for OpenAI) |
--send-rate | 16000 | Microphone sample rate in Hz |
--recv-rate | 16000 | Speaker sample rate in Hz |
--chunk-size | 1024 | Audio callback block size in samples |
--print-json | false | Print raw event payloads for debugging |
--block-mic-during-playback | false | Pause mic capture while audio is playing |
Session Configuration via session.update
After connecting, send a session.update event to configure behaviour for the session. Settings deep-merge into the running RuntimeConfig and take effect on the next turn:
Live Transcription
When--enable_live_transcription is set (the default), the server emits streaming partial transcripts while the user is speaking:
conversation.item.input_audio_transcription.delta— partial hypothesis, updated every ~500 msconversation.item.input_audio_transcription.completed— final transcript with duration usage
Barge-In and Interruption Handling
Interruption (barge-in) is handled by a sharedCancelScope object. When VAD detects the user speaking during assistant playback:
Send loop cancels the active response
The
_send_loop calls cancel_scope.cancel(), which increments the generation counter and sets a discard flag. The client receives response.done with status="cancelled" and reason="turn_detected".LLM and TTS abort
Each handler captured the generation number at the start of the response. On every streaming token they call
cancel_scope.is_stale(gen), and immediately abort when the generation has been superseded.Discard guard clears
Stale audio/text arriving between
cancel() and __RESPONSE_DONE__ is silently dropped. The discard guard clears when __RESPONSE_DONE__ arrives.response.cancel event.
LLM Backend Examples
- OpenAI (Responses API)
- vLLM (local)
- MLX-LM (Apple Silicon)
- HF Inference Providers
- Transformers (local)
Supported OpenAI Realtime Events
Client → Server
| Event | Description |
|---|---|
input_audio_buffer.append | Stream base64 PCM audio. Decoded, resampled to 16 kHz, and chunked for VAD. |
session.update | Deep-merge session config (instructions, tools, voice, turn detection, audio format). |
conversation.item.create | Inject input_text or function_call_output into the LLM context without triggering generation. |
response.create | Trigger LLM generation. Supports per-response instructions and tool_choice overrides. |
response.cancel | Cancel the in-progress response and re-enable listening. |
Server → Client
| Event | Description |
|---|---|
session.created | Sent on connection with current session config. |
error | Protocol errors such as session_limit_reached, unknown_or_invalid_event, invalid_session_type, conversation_already_has_active_response. |
input_audio_buffer.speech_started | VAD detected user speech. |
input_audio_buffer.speech_stopped | End of user speech segment. |
conversation.item.created | Acknowledges injected input_text from conversation.item.create. |
conversation.item.input_audio_transcription.delta | Streaming partial transcript (when live transcription is enabled). |
conversation.item.input_audio_transcription.completed | Final transcript for the user turn with duration usage. |
response.created | Emitted on the first outbound audio chunk (response is in_progress). |
response.output_audio.delta | Base64 PCM audio chunk from TTS. |
response.output_audio.done | Audio stream complete for the current output item. |
response.output_audio_transcript.done | Full assistant text transcript for the turn. |
response.function_call_arguments.done | Tool call with call_id, name, and JSON arguments. |
response.done | Response finished: completed, cancelled with reason turn_detected or client_cancelled. |