Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

Speech to Speech ships a single speech-to-speech CLI command that starts the full VAD→STT→LLM→TTS pipeline and, by default, exposes an OpenAI Realtime-compatible WebSocket server on port 8765. The steps below walk from a fresh install to a working voice agent in three steps, then show how to connect a client and explore alternative configurations.
1
Install
2
Install the package from PyPI:
3
pip install speech-to-speech
4
On Linux, if your CUDA version is not 12.8, pre-install the matching qwentts-cpp-python wheel first — see the Installation guide for the exact commands.
5
Set your OpenAI API key
6
The default pipeline routes LLM inference through the OpenAI Responses API. Export your key before launching:
7
export OPENAI_API_KEY=your_key_here
8
You can also pass it explicitly with --responses_api_api_key if you prefer not to set an environment variable.
9
Run the pipeline
10
speech-to-speech
11
That’s it. The pipeline starts, loads its models, and listens on ws://localhost:8765/v1/realtime. You should see log output as each stage initialises.

What the Default Command Does

The bare speech-to-speech command is equivalent to the following fully-expanded invocation. Every flag shown here is a default; you can override any of them:
speech-to-speech \
    --thresh 0.6 \
    --stt parakeet-tdt \
    --llm_backend responses-api \
    --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_speaker Aiden \
    --qwen3_tts_language auto \
    --qwen3_tts_backend ggml \
    --qwen3_tts_non_streaming_mode True \
    --qwen3_tts_mlx_quantization 6bit \
    --model_name gpt-5.4-mini \
    --chat_size 30 \
    --responses_api_stream \
    --enable_live_transcription \
    --mode realtime
The server binds to port 8765 by default and exposes the endpoint at /v1/realtime. Connect any OpenAI Realtime-compatible client to ws://localhost:8765/v1/realtime. Override the port with --ws_port and the bind address with --ws_host.

Alternative Quickstarts

# Uses OpenAI gpt-5.4-mini as the LLM with local Parakeet TDT + Qwen3-TTS
export OPENAI_API_KEY=your_key_here
speech-to-speech

Connect with the OpenAI Realtime Client

Once the server is running in --mode realtime (the default), connect to it from Python using the official openai package. Because Speech to Speech implements the OpenAI Realtime protocol, no special client code is needed — just point base_url at your local server:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8765/v1", api_key="not-needed")

with client.beta.realtime.connect(model="model_name") as conn:
    conn.session.update(
        session={
            "instructions": "You are a helpful assistant.",
            "turn_detection": {"type": "server_vad", "interrupt_response": True},
        }
    )

    # Send audio, receive events, etc.
    for event in conn:
        print(event.type)
The api_key value passed to OpenAI() is not validated by the Speech to Speech server — any non-empty string works. The actual LLM API key is configured server-side via OPENAI_API_KEY or --responses_api_api_key.

Mac Optimal Settings Shortcut

On Apple Silicon, a single flag sets Parakeet TDT for STT, MLX LM for language model inference, Qwen3-TTS via mlx-audio for TTS, and --device mps for all models. No API key is required:
speech-to-speech --local_mac_optimal_settings
This is equivalent to:
speech-to-speech \
    --mode local \
    --device mps \
    --stt parakeet-tdt \
    --llm_backend mlx-lm \
    --tts qwen3 \
    --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16
--tts pocket and --tts kokoro are also valid TTS choices on macOS when using --local_mac_optimal_settings. Override the default TTS with --tts pocket or --tts kokoro after the flag.

Next Steps

  • Explore all CLI flags with speech-to-speech -h or browse the arguments classes in the source.
  • Swap in a self-hosted LLM server by passing --responses_api_base_url http://localhost:8000/v1 with vLLM or llama.cpp.
  • Use --language auto with --enable_lang_prompt for automatic multilingual conversation (English, French, Spanish, Chinese, Japanese, Korean).
  • Run a pool of parallel pipelines with --num_pipelines N (requires --mode realtime) to serve multiple concurrent WebSocket sessions.

Build docs developers (and LLMs) love