Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

ModuleArguments is the first argument class parsed by the CLI and controls the high-level shape of the pipeline: which mode it runs in, which backend is selected for each stage, and global settings that apply across all handlers. These flags have no prefix — pass them directly, for example --mode local or --tts kokoro.

Fields

device
string
If specified, overrides the device for all handlers in the pipeline (VAD, STT, LLM, TTS). When omitted, each handler uses its own default device (typically cuda). Useful to force everything to cpu or mps in one flag.
speech-to-speech --device mps
mode
'local' | 'socket' | 'websocket' | 'realtime'
default:"realtime"
Selects the pipeline’s I/O mode:
ValueDescription
realtimeExposes an OpenAI Realtime-compatible WebSocket server at /v1/realtime
localReads from the local microphone and plays audio through the local speaker
socketStreams audio in/out over TCP sockets (see --recv_host / --send_host)
websocketStreams audio in/out over a WebSocket (see --ws_host / --ws_port)
speech-to-speech --mode local
local_mac_optimal_settings
boolean
default:"false"
When true, applies an opinionated preset for Apple Silicon: sets --device mps, selects Parakeet TDT for STT, MLX LM for the language model, and Qwen3-TTS for TTS. Flags specified after this one override individual parts of the preset.
speech-to-speech --local_mac_optimal_settings \
    --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16
stt
'whisper' | 'whisper-mlx' | 'mlx-audio-whisper' | 'faster-whisper' | 'parakeet-tdt' | 'paraformer'
default:"parakeet-tdt"
Selects the Speech-to-Text backend:
ValueBackendBest for
parakeet-tdtNVIDIA Parakeet TDT 0.6B v3Low-latency streaming; default
whisperTransformers WhisperGPU servers with full HF integration
whisper-mlxLightning Whisper MLXApple Silicon (legacy)
mlx-audio-whisperMLX Audio WhisperApple Silicon (fast)
faster-whisperCTranslate2 Faster-WhisperCPU/GPU with quantized inference
paraformerFunASR ParaformerMandarin and multilingual ASR
speech-to-speech --stt faster-whisper
llm_backend
'transformers' | 'mlx-lm' | 'responses-api' | 'chat-completions'
default:"responses-api"
Selects the language model backend:
ValueDescription
responses-apiOpenAI-compatible Responses API (OpenAI, HF Inference, vLLM, llama.cpp, …)
chat-completionsOpenAI-compatible /v1/chat/completions endpoint
transformersLocal inference via Hugging Face Transformers
mlx-lmLocal inference on Apple Silicon via MLX
speech-to-speech --llm_backend mlx-lm --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16
tts
'melo' | 'chatTTS' | 'facebookMMS' | 'pocket' | 'kokoro' | 'qwen3'
default:"qwen3"
Selects the Text-to-Speech backend:
ValueBackendNotes
qwen3Qwen3-TTSDefault; GGML on Linux/Windows, MLX on Apple Silicon
kokoroKokoro-82MFast, high-quality; optimised for Apple Silicon
pocketPocket TTSStreaming TTS with voice cloning from Kyutai Labs
chatTTSChatTTSStreaming synthesis
facebookMMSFacebook MMSMultilingual coverage
meloMeloTTSDeprecated; available in archive/
speech-to-speech --tts kokoro --kokoro_voice bm_fable
log_level
string
default:"info"
Sets the logging verbosity for all pipeline components. Accepts standard Python logging level names.
speech-to-speech --log_level debug
enable_live_transcription
boolean
default:"true"
When true, streams partial transcription results to connected clients while the user is speaking. Works with the Parakeet TDT backend in realtime mode and surfaces as conversation.item.input_audio_transcription.delta events.
speech-to-speech --enable_live_transcription
live_transcription_update_interval
float
default:"0.5"
How often (in seconds) the live transcription result is updated and emitted. Smaller values give more frequent partial updates at the cost of slightly higher CPU usage.
speech-to-speech --live_transcription_update_interval 0.25
live_transcription_min_silence_ms
integer
default:"500"
Minimum silence duration in milliseconds before the live transcription considers the speech segment complete. Increase this value if transcription cuts off too early in quiet environments.
speech-to-speech --live_transcription_min_silence_ms 800
num_pipelines
integer
default:"1"
Number of isolated realtime pipeline instances in the pool. Each pipeline has its own VAD, STT, LLM, and TTS handlers plus its own conversation state. The single uvicorn server on --ws_port routes each incoming WebSocket to the next free pipeline. Connections beyond num_pipelines are rejected. Only valid with --mode realtime.
speech-to-speech --mode realtime --num_pipelines 4

Common invocation patterns

speech-to-speech --mode realtime --num_pipelines 2

Build docs developers (and LLMs) love