WebSocket mode exposes a plainDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
ws:// endpoint for bidirectional audio streaming. Unlike Realtime mode, which implements the full OpenAI Realtime protocol with JSON event messages and base64-encoded PCM, WebSocket mode uses raw binary frames: the client sends raw PCM bytes to the server and receives raw PCM bytes back. This makes it the right choice when you are building a custom client — a browser app, a mobile app, or a device — and want full control over the audio framing without adopting the OpenAI event schema.
Starting the WebSocket Server
websockets async server and logs:
Configuration Flags
| Flag | Default | Description |
|---|---|---|
--ws_host | 0.0.0.0 | Host IP address the WebSocket server binds to |
--ws_port | 8765 | Port the WebSocket server listens on |
Audio Format
Both directions carry the same raw PCM format:| Property | Value |
|---|---|
| Sample rate | 16,000 Hz |
| Bit depth | 16-bit signed integer (int16) |
| Channels | mono (1 channel) |
| Frame encoding | Raw binary bytes (no base64, no JSON wrapper) |
int16 PCM bytes as binary WebSocket frames. The WebSocketStreamer accumulates incoming bytes into a per-client remainder buffer and chops them into aligned 512-sample (1,024-byte) chunks before placing them on the VAD input queue. Frames that straddle a 512-sample boundary are never dropped — the remainder carries over to the next frame.
Receiving audio: the server buffers outbound TTS chunks until at least 100 ms of audio has accumulated (3,200 bytes at 16 kHz int16) before sending, reducing the number of WebSocket frames the client must handle.
Connecting a Client
Any WebSocket library can connect. Example using the Pythonwebsockets package:
Difference from Realtime Mode
- WebSocket Mode
- Realtime Mode
- Plain binary WebSocket frames
- Raw int16 PCM bytes in both directions
- No JSON event envelope
- No session configuration protocol
- No built-in barge-in/cancellation signalling
- Suitable for custom clients that manage their own session logic
Multiple Clients
The server keeps aset[ServerConnection] of all connected clients. When audio is ready to send, it broadcasts to all connected clients with asyncio.gather. Incoming audio from any client is forwarded to the shared VAD input queue.
When the last client disconnects, a SESSION_END control message is placed on the input queue to cleanly flush pipeline state.
The WebSocket server accepts any number of concurrent clients, but all clients share a single pipeline instance. If you need isolated conversation state per client, use Realtime mode with
--num_pipelines.LLM Backend Examples
- Responses API
- MLX-LM (Apple Silicon)
Installing the WebSocket Extra
WebSocket mode requires thewebsockets package. Install it with the bundled extra: