Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

Realtime mode is the default operating mode for Speech to Speech. It starts a FastAPI/uvicorn server that exposes a WebSocket endpoint at /v1/realtime, fully compatible with the OpenAI Realtime API protocol. Any client that speaks the OpenAI Realtime protocol — the official Python SDK, a custom client, or a voice UI library — can connect and begin streaming audio immediately. The pipeline handles VAD, STT, LLM generation, and TTS in parallel threads, streaming audio back as base64-encoded PCM delta events.

Starting the Server

Running speech-to-speech with no arguments launches realtime mode using Parakeet TDT for STT, the OpenAI Responses API for the LLM, and Qwen3-TTS for speech output:
speech-to-speech
This is equivalent to the explicit form:
speech-to-speech \
    --thresh 0.6 \
    --stt parakeet-tdt \
    --llm_backend responses-api \
    --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_speaker Aiden \
    --qwen3_tts_language auto \
    --qwen3_tts_backend ggml \
    --qwen3_tts_non_streaming_mode True \
    --qwen3_tts_mlx_quantization 6bit \
    --model_name gpt-5.4-mini \
    --chat_size 30 \
    --responses_api_stream \
    --enable_live_transcription \
    --mode realtime
The server binds to 0.0.0.0:8765 by default. The WebSocket endpoint is ws://<host>:8765/v1/realtime.

Server Configuration Flags

FlagDefaultDescription
--ws_host0.0.0.0Host IP address the WebSocket server binds to
--ws_port8765Port the WebSocket server listens on
--num_pipelines1Size of the isolated pipeline pool (max concurrent sessions)
--enable_live_transcriptiontrueStream partial user transcripts as transcription.delta events

Binding to a custom host and port

speech-to-speech --mode realtime --ws_host 0.0.0.0 --ws_port 9000

Concurrent session pool

By default, only one WebSocket session is active at a time. Use --num_pipelines to create a pool of isolated VAD/STT/LLM/TTS handler chains so multiple clients can connect simultaneously. Connections beyond the pool size are rejected with a session_limit_reached error.
speech-to-speech --mode realtime --num_pipelines 4

Connecting with the OpenAI Python Client

Any client implementing the OpenAI Realtime protocol can connect. The official openai Python SDK works out of the box — point base_url at your server’s HTTP address:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8765/v1", api_key="not-needed")

with client.beta.realtime.connect(model="model_name") as conn:
    conn.session.update(
        session={
            "instructions": "You are a helpful assistant.",
            "turn_detection": {"type": "server_vad", "interrupt_response": True},
        }
    )

    for event in conn:
        print(event.type)
The companion script scripts/listen_and_play_realtime.py provides a ready-to-run microphone/speaker client:
python scripts/listen_and_play_realtime.py \
    --host 127.0.0.1 \
    --port 8765 \
    --model local \
    --instructions "You are a helpful assistant."
Additional flags for the script:
FlagDefaultDescription
--voice(none)TTS voice (e.g. bm_fable for Kokoro, marin for OpenAI)
--send-rate16000Microphone sample rate in Hz
--recv-rate16000Speaker sample rate in Hz
--chunk-size1024Audio callback block size in samples
--print-jsonfalsePrint raw event payloads for debugging
--block-mic-during-playbackfalsePause mic capture while audio is playing

Session Configuration via session.update

After connecting, send a session.update event to configure behaviour for the session. Settings deep-merge into the running RuntimeConfig and take effect on the next turn:
conn.session.update(
    session={
        "instructions": "You are a concise assistant. Reply in one sentence.",
        "turn_detection": {
            "type": "server_vad",
            "interrupt_response": True,
        },
        "tools": [
            {
                "type": "function",
                "name": "get_weather",
                "description": "Return current weather for a city.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"}
                    },
                    "required": ["city"],
                },
            }
        ],
    }
)

Live Transcription

When --enable_live_transcription is set (the default), the server emits streaming partial transcripts while the user is speaking:
  • conversation.item.input_audio_transcription.delta — partial hypothesis, updated every ~500 ms
  • conversation.item.input_audio_transcription.completed — final transcript with duration usage
speech-to-speech --mode realtime --enable_live_transcription

Barge-In and Interruption Handling

Interruption (barge-in) is handled by a shared CancelScope object. When VAD detects the user speaking during assistant playback:
1

VAD emits speech_started

The VAD places a speech_started event on the internal text_output_queue.
2

Send loop cancels the active response

The _send_loop calls cancel_scope.cancel(), which increments the generation counter and sets a discard flag. The client receives response.done with status="cancelled" and reason="turn_detected".
3

LLM and TTS abort

Each handler captured the generation number at the start of the response. On every streaming token they call cancel_scope.is_stale(gen), and immediately abort when the generation has been superseded.
4

Discard guard clears

Stale audio/text arriving between cancel() and __RESPONSE_DONE__ is silently dropped. The discard guard clears when __RESPONSE_DONE__ arrives.
5

Pipeline processes the new utterance

should_listen is re-enabled and the pipeline begins processing the user’s new speech.
The client can also cancel programmatically by sending a response.cancel event.

LLM Backend Examples

export OPENAI_API_KEY=sk-...
speech-to-speech \
    --mode realtime \
    --stt parakeet-tdt \
    --llm_backend responses-api \
    --tts qwen3 \
    --model_name gpt-4o-mini \
    --responses_api_stream \
    --enable_live_transcription

Supported OpenAI Realtime Events

Client → Server

EventDescription
input_audio_buffer.appendStream base64 PCM audio. Decoded, resampled to 16 kHz, and chunked for VAD.
session.updateDeep-merge session config (instructions, tools, voice, turn detection, audio format).
conversation.item.createInject input_text or function_call_output into the LLM context without triggering generation.
response.createTrigger LLM generation. Supports per-response instructions and tool_choice overrides.
response.cancelCancel the in-progress response and re-enable listening.

Server → Client

EventDescription
session.createdSent on connection with current session config.
errorProtocol errors such as session_limit_reached, unknown_or_invalid_event, invalid_session_type, conversation_already_has_active_response.
input_audio_buffer.speech_startedVAD detected user speech.
input_audio_buffer.speech_stoppedEnd of user speech segment.
conversation.item.createdAcknowledges injected input_text from conversation.item.create.
conversation.item.input_audio_transcription.deltaStreaming partial transcript (when live transcription is enabled).
conversation.item.input_audio_transcription.completedFinal transcript for the user turn with duration usage.
response.createdEmitted on the first outbound audio chunk (response is in_progress).
response.output_audio.deltaBase64 PCM audio chunk from TTS.
response.output_audio.doneAudio stream complete for the current output item.
response.output_audio_transcript.doneFull assistant text transcript for the turn.
response.function_call_arguments.doneTool call with call_id, name, and JSON arguments.
response.doneResponse finished: completed, cancelled with reason turn_detected or client_cancelled.

Build docs developers (and LLMs) love