Get Started with Speech to Speech

Speech to Speech ships a single speech-to-speech CLI command that starts the full VAD→STT→LLM→TTS pipeline and, by default, exposes an OpenAI Realtime-compatible WebSocket server on port 8765. The steps below walk from a fresh install to a working voice agent in three steps, then show how to connect a client and explore alternative configurations.

Install

Install the package from PyPI:

pip install speech-to-speech

On Linux, if your CUDA version is not 12.8, pre-install the matching qwentts-cpp-python wheel first — see the Installation guide for the exact commands.

Set your OpenAI API key

The default pipeline routes LLM inference through the OpenAI Responses API. Export your key before launching:

export OPENAI_API_KEY=your_key_here

You can also pass it explicitly with --responses_api_api_key if you prefer not to set an environment variable.

Run the pipeline

speech-to-speech

That’s it. The pipeline starts, loads its models, and listens on ws://localhost:8765/v1/realtime. You should see log output as each stage initialises.

What the Default Command Does

The bare speech-to-speech command is equivalent to the following fully-expanded invocation. Every flag shown here is a default; you can override any of them:

speech-to-speech \
    --thresh 0.6 \
    --stt parakeet-tdt \
    --llm_backend responses-api \
    --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_speaker Aiden \
    --qwen3_tts_language auto \
    --qwen3_tts_backend ggml \
    --qwen3_tts_non_streaming_mode True \
    --qwen3_tts_mlx_quantization 6bit \
    --model_name gpt-5.4-mini \
    --chat_size 30 \
    --responses_api_stream \
    --enable_live_transcription \
    --mode realtime

The server binds to port 8765 by default and exposes the endpoint at /v1/realtime. Connect any OpenAI Realtime-compatible client to ws://localhost:8765/v1/realtime. Override the port with --ws_port and the bind address with --ws_host.

Alternative Quickstarts

# Uses OpenAI gpt-5.4-mini as the LLM with local Parakeet TDT + Qwen3-TTS
export OPENAI_API_KEY=your_key_here
speech-to-speech

Connect with the OpenAI Realtime Client

Once the server is running in --mode realtime (the default), connect to it from Python using the official openai package. Because Speech to Speech implements the OpenAI Realtime protocol, no special client code is needed — just point base_url at your local server:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8765/v1", api_key="not-needed")

with client.beta.realtime.connect(model="model_name") as conn:
    conn.session.update(
        session={
            "instructions": "You are a helpful assistant.",
            "turn_detection": {"type": "server_vad", "interrupt_response": True},
        }
    )

    # Send audio, receive events, etc.
    for event in conn:
        print(event.type)

The api_key value passed to OpenAI() is not validated by the Speech to Speech server — any non-empty string works. The actual LLM API key is configured server-side via OPENAI_API_KEY or --responses_api_api_key.

Mac Optimal Settings Shortcut

On Apple Silicon, a single flag sets Parakeet TDT for STT, MLX LM for language model inference, Qwen3-TTS via mlx-audio for TTS, and --device mps for all models. No API key is required:

speech-to-speech --local_mac_optimal_settings

This is equivalent to:

speech-to-speech \
    --mode local \
    --device mps \
    --stt parakeet-tdt \
    --llm_backend mlx-lm \
    --tts qwen3 \
    --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

--tts pocket and --tts kokoro are also valid TTS choices on macOS when using --local_mac_optimal_settings. Override the default TTS with --tts pocket or --tts kokoro after the flag.

Next Steps

Explore all CLI flags with speech-to-speech -h or browse the arguments classes in the source.
Swap in a self-hosted LLM server by passing --responses_api_base_url http://localhost:8000/v1 with vLLM or llama.cpp.
Use --language auto with --enable_lang_prompt for automatic multilingual conversation (English, French, Spanish, Chinese, Japanese, Korean).
Run a pool of parallel pipelines with --num_pipelines N (requires --mode realtime) to serve multiple concurrent WebSocket sessions.

Get Started

Pipeline Modes

Pipeline Components

Guides

Get Started with Speech to Speech

What the Default Command Does

Alternative Quickstarts

Connect with the OpenAI Realtime Client

Mac Optimal Settings Shortcut

Next Steps

Build docs developers (and LLMs) love

Get Started

Pipeline Modes

Pipeline Components

Guides

Documentation Index

​What the Default Command Does

​Alternative Quickstarts

​Connect with the OpenAI Realtime Client

​Mac Optimal Settings Shortcut

​Next Steps

Build docs developers (and LLMs) love

What the Default Command Does

Alternative Quickstarts

Connect with the OpenAI Realtime Client

Mac Optimal Settings Shortcut

Next Steps