Server/Client mode splits the pipeline across two processes: the server runs VAD, STT, LLM, and TTS on a remote machine (or a GPU workstation on your local network), while the client captures microphone audio and plays back the generated speech locally. Audio travels over two persistent TCP socket connections — one for each direction — using raw 16 kHz, int16, mono PCM. This mode is ideal when you want GPU-accelerated inference on a server but need the audio I/O to happen on a laptop or a different endpoint.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
Architecture
Starting the Server
Bind both sockets to all interfaces so the client can reach them from any IP:realtime, so --mode socket must be given explicitly. All four port and host flags can be specified together:
Running the Client
Installsounddevice and transformers on the client machine, then run:
recv_port (12345) to send microphone data, and one to send_port (12346) to receive generated audio — and bridges them to sounddevice streams.
Client Arguments
| Argument | Default | Description |
|---|---|---|
--host | localhost | Server hostname or IP address |
--send_port | 12345 | Port to send microphone audio to |
--recv_port | 12346 | Port to receive generated audio from |
--send_rate | 16000 | Microphone sample rate in Hz |
--recv_rate | 16000 | Speaker sample rate in Hz |
--list_play_chunk_size | 1024 | Size of each audio chunk in bytes |
Socket Ports Reference
| Socket | Default Port | Direction | Purpose |
|---|---|---|---|
SocketReceiver | 12345 | Client → Server | Microphone audio from client to VAD |
SocketSender | 12346 | Server → Client | Generated TTS audio from server to speakers |
--recv_port and --send_port on the server, and --send_port / --recv_port on the client script.
Audio Format
Both sockets carry the same raw PCM format:- Sample rate: 16,000 Hz
- Bit depth: 16-bit signed integer (
int16) - Channels: mono (1 channel)
- Chunk size: 1,024 bytes (server default); configurable with
--chunk_size
SocketReceiver calls receive_full_chunk in a loop so TCP segment boundaries never produce partial chunks.
LLM Backend Examples
- Responses API (default)
- vLLM (local server)
- Transformers (CUDA)
Timeout and Stuck-Pipeline Safety
SocketReceiver includes a 30-second safety timeout: if should_listen remains cleared for longer than 30 seconds — indicating that the LLM or TTS handler may have crashed — the receiver automatically re-enables listening so the user is not permanently locked out. A warning is logged when this happens.