Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

Server/Client mode splits the pipeline across two processes: the server runs VAD, STT, LLM, and TTS on a remote machine (or a GPU workstation on your local network), while the client captures microphone audio and plays back the generated speech locally. Audio travels over two persistent TCP socket connections — one for each direction — using raw 16 kHz, int16, mono PCM. This mode is ideal when you want GPU-accelerated inference on a server but need the audio I/O to happen on a laptop or a different endpoint.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  CLIENT  (scripts/listen_and_play.py)                           │
│                                                                 │
│  Microphone → send_socket → :12345 ──────────────────────────► │
│                                                                 │
│  Speaker   ← recv_socket ← :12346 ◄────────────────────────── │
└─────────────────────────────────────────────────────────────────┘
                          │ TCP │ TCP │
┌─────────────────────────────────────────────────────────────────┐
│  SERVER  (speech-to-speech)                                     │
│                                                                 │
│  SocketReceiver (:12345) → VAD → STT → LLM → TTS               │
│                                                                 │
│  SocketSender   (:12346) ←────────────────────────────────────  │
└─────────────────────────────────────────────────────────────────┘

Starting the Server

Bind both sockets to all interfaces so the client can reach them from any IP:
speech-to-speech \
    --mode socket \
    --recv_host 0.0.0.0 \
    --send_host 0.0.0.0
The default mode is realtime, so --mode socket must be given explicitly. All four port and host flags can be specified together:
speech-to-speech \
    --mode socket \
    --recv_host 0.0.0.0 \
    --recv_port 12345 \
    --send_host 0.0.0.0 \
    --send_port 12346
The server waits on both ports for a client connection before the pipeline begins processing.

Running the Client

Install sounddevice and transformers on the client machine, then run:
python scripts/listen_and_play.py --host <IP address of your server>
The script connects two sockets — one to recv_port (12345) to send microphone data, and one to send_port (12346) to receive generated audio — and bridges them to sounddevice streams.

Client Arguments

ArgumentDefaultDescription
--hostlocalhostServer hostname or IP address
--send_port12345Port to send microphone audio to
--recv_port12346Port to receive generated audio from
--send_rate16000Microphone sample rate in Hz
--recv_rate16000Speaker sample rate in Hz
--list_play_chunk_size1024Size of each audio chunk in bytes

Socket Ports Reference

SocketDefault PortDirectionPurpose
SocketReceiver12345Client → ServerMicrophone audio from client to VAD
SocketSender12346Server → ClientGenerated TTS audio from server to speakers
Override the defaults with --recv_port and --send_port on the server, and --send_port / --recv_port on the client script.

Audio Format

Both sockets carry the same raw PCM format:
  • Sample rate: 16,000 Hz
  • Bit depth: 16-bit signed integer (int16)
  • Channels: mono (1 channel)
  • Chunk size: 1,024 bytes (server default); configurable with --chunk_size
The client streams at the same 16 kHz rate. The server’s SocketReceiver calls receive_full_chunk in a loop so TCP segment boundaries never produce partial chunks.

LLM Backend Examples

# Server
speech-to-speech \
    --mode socket \
    --recv_host 0.0.0.0 \
    --send_host 0.0.0.0 \
    --llm_backend responses-api \
    --model_name gpt-4o-mini \
    --responses_api_api_key "$OPENAI_API_KEY" \
    --responses_api_stream

# Client
python scripts/listen_and_play.py --host 192.168.1.42

Timeout and Stuck-Pipeline Safety

SocketReceiver includes a 30-second safety timeout: if should_listen remains cleared for longer than 30 seconds — indicating that the LLM or TTS handler may have crashed — the receiver automatically re-enables listening so the user is not permanently locked out. A warning is logged when this happens.
The server blocks on socket.accept() for each port and only accepts one client at a time. For multi-client or browser-based scenarios, use WebSocket mode or Realtime mode instead.

Build docs developers (and LLMs) love