Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

The speech-to-speech package exposes a single command-line entrypoint — speech-to-speech — that starts the full VAD → STT → LLM → TTS pipeline. The same pipeline can also be launched via the Python module form:
python -m speech_to_speech.s2s_pipeline
All configuration is passed as CLI flags. Running with -h prints the full reference:
speech-to-speech -h

Argument groups

Every flag belongs to one of the argument classes below. Each class is parsed independently and covers one stage or concern of the pipeline:
Argument classReference pageScope
ModuleArgumentsModule ArgsTop-level mode, backend selection, live transcription
VADHandlerArgumentsVAD ArgsSilero VAD v5 sensitivity, timing, and audio enhancement
WhisperSTTHandlerArgumentsSTT ArgsWhisper (--stt whisper)
FasterWhisperSTTHandlerArgumentsSTT ArgsFaster-Whisper (--stt faster-whisper)
ParakeetTDTSTTHandlerArgumentsSTT ArgsParakeet TDT (--stt parakeet-tdt)
ParaformerSTTHandlerArgumentsSTT ArgsParaformer / FunASR (--stt paraformer)
MLXAudioWhisperSTTHandlerArgumentsSTT ArgsMLX Audio Whisper (--stt mlx-audio-whisper)
LanguageModelBaseArgumentsLLM ArgsShared model name, chat history, system prompt
LanguageModelHandlerArgumentsLLM ArgsTransformers / mlx-lm local backends
ResponsesApiLanguageModelHandlerArgumentsLLM ArgsOpenAI Responses API backend
ChatCompletionsLanguageModelHandlerArgumentsLLM ArgsOpenAI Chat Completions backend
Qwen3TTSHandlerArgumentsTTS ArgsQwen3-TTS (--tts qwen3)
KokoroTTSHandlerArgumentsTTS ArgsKokoro-82M (--tts kokoro)
PocketTTSHandlerArgumentsTTS ArgsPocket TTS (--tts pocket)
ChatTTSHandlerArgumentsTTS ArgsChatTTS (--tts chatTTS)
FacebookMMSTTSHandlerArgumentsTTS ArgsFacebook MMS (--tts facebookMMS)
SocketReceiverArgumentsConnection ArgsTCP socket receiver
SocketSenderArgumentsConnection ArgsTCP socket sender
WebSocketStreamerArgumentsConnection ArgsWebSocket streamer

Flag prefix system

Because multiple STT, LLM, and TTS backends can coexist in the argument namespace, each backend’s flags are namespaced by a prefix:
BackendCLI prefixExample
Whisper (Transformers)--stt_--stt_model_name openai/whisper-large-v3
Faster-Whisper--faster_whisper_stt_--faster_whisper_stt_model_name large-v3
Parakeet TDT--parakeet_tdt_--parakeet_tdt_device mps
Paraformer--paraformer_stt_--paraformer_stt_model_name paraformer-zh
MLX Audio Whisper--mlx_audio_whisper_--mlx_audio_whisper_model_name mlx-community/whisper-large-v3-turbo
Local LLM (transformers/mlx-lm)--llm_--llm_device cuda
Shared LLM (all backends)(no prefix)--model_name gpt-4o-mini
Responses API / Chat Completions--responses_api_--responses_api_base_url http://localhost:8000/v1
Qwen3-TTS--qwen3_tts_--qwen3_tts_speaker Aiden
Kokoro TTS--kokoro_--kokoro_voice bm_fable
Pocket TTS--pocket_tts_--pocket_tts_voice jean
ChatTTS--chat_tts_--chat_tts_device cuda
Facebook MMS--facebook_mms_--facebook_mms_device cuda

The gen_kwargs pattern

Generation parameters follow the <handler_prefix>_gen_<param> naming convention. At parse time the pipeline strips the handler prefix and collects every gen_-prefixed field into a gen_kwargs dict that is forwarded directly to the underlying model’s generate() call:
# Cap Whisper transcription length
speech-to-speech --stt_gen_max_new_tokens 128

# Sample from the LLM at temperature 0.7
speech-to-speech --llm_gen_temperature 0.7

# Use beam search for Whisper
speech-to-speech --stt_gen_num_beams 4
Only fields that are explicitly declared in the corresponding argument dataclass are accepted as CLI flags. Refer to each backend’s section in STT Args or LLM Args for the full list of supported gen_* parameters.

Passing arguments from a JSON file

When the JSON config file is the only argument (no other CLI flags), the CLI reads configuration from that file instead of the shell:
{
  "stt": "parakeet-tdt",
  "llm_backend": "responses-api",
  "tts": "qwen3",
  "model_name": "gpt-4o-mini",
  "responses_api_stream": true,
  "enable_live_transcription": true
}
speech-to-speech my_config.json
This is useful for reproducible experiment configs and deployment scripts.

Default command and its full expansion

Running speech-to-speech with no arguments starts the realtime server with Parakeet TDT, the OpenAI Responses API, and Qwen3-TTS. The short form:
speech-to-speech
Is exactly equivalent to:
speech-to-speech \
    --thresh 0.6 \
    --stt parakeet-tdt \
    --llm_backend responses-api \
    --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_speaker Aiden \
    --qwen3_tts_language auto \
    --qwen3_tts_backend ggml \
    --qwen3_tts_non_streaming_mode True \
    --qwen3_tts_mlx_quantization 6bit \
    --model_name gpt-5.4-mini \
    --chat_size 30 \
    --responses_api_stream \
    --enable_live_transcription \
    --mode realtime
The default command requires OPENAI_API_KEY to be set in your environment, or the key passed explicitly via --responses_api_api_key. For non-OpenAI providers also set --responses_api_base_url.

Build docs developers (and LLMs) love