speech-to-speech CLI: Complete Command Reference

The speech-to-speech package exposes a single command-line entrypoint — speech-to-speech — that starts the full VAD → STT → LLM → TTS pipeline. The same pipeline can also be launched via the Python module form:

python -m speech_to_speech.s2s_pipeline

All configuration is passed as CLI flags. Running with -h prints the full reference:

speech-to-speech -h

Argument groups

Every flag belongs to one of the argument classes below. Each class is parsed independently and covers one stage or concern of the pipeline:

Argument class	Reference page	Scope
`ModuleArguments`	Module Args	Top-level mode, backend selection, live transcription
`VADHandlerArguments`	VAD Args	Silero VAD v5 sensitivity, timing, and audio enhancement
`WhisperSTTHandlerArguments`	STT Args	Whisper (`--stt whisper`)
`FasterWhisperSTTHandlerArguments`	STT Args	Faster-Whisper (`--stt faster-whisper`)
`ParakeetTDTSTTHandlerArguments`	STT Args	Parakeet TDT (`--stt parakeet-tdt`)
`ParaformerSTTHandlerArguments`	STT Args	Paraformer / FunASR (`--stt paraformer`)
`MLXAudioWhisperSTTHandlerArguments`	STT Args	MLX Audio Whisper (`--stt mlx-audio-whisper`)
`LanguageModelBaseArguments`	LLM Args	Shared model name, chat history, system prompt
`LanguageModelHandlerArguments`	LLM Args	Transformers / mlx-lm local backends
`ResponsesApiLanguageModelHandlerArguments`	LLM Args	OpenAI Responses API backend
`ChatCompletionsLanguageModelHandlerArguments`	LLM Args	OpenAI Chat Completions backend
`Qwen3TTSHandlerArguments`	TTS Args	Qwen3-TTS (`--tts qwen3`)
`KokoroTTSHandlerArguments`	TTS Args	Kokoro-82M (`--tts kokoro`)
`PocketTTSHandlerArguments`	TTS Args	Pocket TTS (`--tts pocket`)
`ChatTTSHandlerArguments`	TTS Args	ChatTTS (`--tts chatTTS`)
`FacebookMMSTTSHandlerArguments`	TTS Args	Facebook MMS (`--tts facebookMMS`)
`SocketReceiverArguments`	Connection Args	TCP socket receiver
`SocketSenderArguments`	Connection Args	TCP socket sender
`WebSocketStreamerArguments`	Connection Args	WebSocket streamer

Flag prefix system

Because multiple STT, LLM, and TTS backends can coexist in the argument namespace, each backend’s flags are namespaced by a prefix:

Backend	CLI prefix	Example
Whisper (Transformers)	`--stt_`	`--stt_model_name openai/whisper-large-v3`
Faster-Whisper	`--faster_whisper_stt_`	`--faster_whisper_stt_model_name large-v3`
Parakeet TDT	`--parakeet_tdt_`	`--parakeet_tdt_device mps`
Paraformer	`--paraformer_stt_`	`--paraformer_stt_model_name paraformer-zh`
MLX Audio Whisper	`--mlx_audio_whisper_`	`--mlx_audio_whisper_model_name mlx-community/whisper-large-v3-turbo`
Local LLM (transformers/mlx-lm)	`--llm_`	`--llm_device cuda`
Shared LLM (all backends)	(no prefix)	`--model_name gpt-4o-mini`
Responses API / Chat Completions	`--responses_api_`	`--responses_api_base_url http://localhost:8000/v1`
Qwen3-TTS	`--qwen3_tts_`	`--qwen3_tts_speaker Aiden`
Kokoro TTS	`--kokoro_`	`--kokoro_voice bm_fable`
Pocket TTS	`--pocket_tts_`	`--pocket_tts_voice jean`
ChatTTS	`--chat_tts_`	`--chat_tts_device cuda`
Facebook MMS	`--facebook_mms_`	`--facebook_mms_device cuda`

The `gen_kwargs` pattern

Generation parameters follow the <handler_prefix>_gen_<param> naming convention. At parse time the pipeline strips the handler prefix and collects every gen_-prefixed field into a gen_kwargs dict that is forwarded directly to the underlying model’s generate() call:

# Cap Whisper transcription length
speech-to-speech --stt_gen_max_new_tokens 128

# Sample from the LLM at temperature 0.7
speech-to-speech --llm_gen_temperature 0.7

# Use beam search for Whisper
speech-to-speech --stt_gen_num_beams 4

Only fields that are explicitly declared in the corresponding argument dataclass are accepted as CLI flags. Refer to each backend’s section in STT Args or LLM Args for the full list of supported gen_* parameters.

Passing arguments from a JSON file

When the JSON config file is the only argument (no other CLI flags), the CLI reads configuration from that file instead of the shell:

{
  "stt": "parakeet-tdt",
  "llm_backend": "responses-api",
  "tts": "qwen3",
  "model_name": "gpt-4o-mini",
  "responses_api_stream": true,
  "enable_live_transcription": true
}

speech-to-speech my_config.json

This is useful for reproducible experiment configs and deployment scripts.

Default command and its full expansion

Running speech-to-speech with no arguments starts the realtime server with Parakeet TDT, the OpenAI Responses API, and Qwen3-TTS. The short form:

speech-to-speech

Is exactly equivalent to:

speech-to-speech \
    --thresh 0.6 \
    --stt parakeet-tdt \
    --llm_backend responses-api \
    --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_speaker Aiden \
    --qwen3_tts_language auto \
    --qwen3_tts_backend ggml \
    --qwen3_tts_non_streaming_mode True \
    --qwen3_tts_mlx_quantization 6bit \
    --model_name gpt-5.4-mini \
    --chat_size 30 \
    --responses_api_stream \
    --enable_live_transcription \
    --mode realtime

The default command requires OPENAI_API_KEY to be set in your environment, or the key passed explicitly via --responses_api_api_key. For non-OpenAI providers also set --responses_api_base_url.

CLI Reference

Realtime API

speech-to-speech CLI: Complete Command Reference

Argument groups

Flag prefix system

The `gen_kwargs` pattern

Passing arguments from a JSON file

Default command and its full expansion

Build docs developers (and LLMs) love

CLI Reference

Realtime API

Documentation Index

​Argument groups

​Flag prefix system

​The gen_kwargs pattern

​Passing arguments from a JSON file

​Default command and its full expansion

Build docs developers (and LLMs) love

Argument groups

Flag prefix system

The `gen_kwargs` pattern

Passing arguments from a JSON file

Default command and its full expansion