Documentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
ModuleArguments is the first argument class parsed by the CLI and controls the high-level shape of the pipeline: which mode it runs in, which backend is selected for each stage, and global settings that apply across all handlers. These flags have no prefix — pass them directly, for example --mode local or --tts kokoro.
Fields
If specified, overrides the device for all handlers in the pipeline (VAD, STT, LLM, TTS). When omitted, each handler uses its own default device (typically
cuda). Useful to force everything to cpu or mps in one flag.Selects the pipeline’s I/O mode:
| Value | Description |
|---|---|
realtime | Exposes an OpenAI Realtime-compatible WebSocket server at /v1/realtime |
local | Reads from the local microphone and plays audio through the local speaker |
socket | Streams audio in/out over TCP sockets (see --recv_host / --send_host) |
websocket | Streams audio in/out over a WebSocket (see --ws_host / --ws_port) |
When
true, applies an opinionated preset for Apple Silicon: sets --device mps, selects Parakeet TDT for STT, MLX LM for the language model, and Qwen3-TTS for TTS. Flags specified after this one override individual parts of the preset.stt
'whisper' | 'whisper-mlx' | 'mlx-audio-whisper' | 'faster-whisper' | 'parakeet-tdt' | 'paraformer'
default:"parakeet-tdt"
Selects the Speech-to-Text backend:
| Value | Backend | Best for |
|---|---|---|
parakeet-tdt | NVIDIA Parakeet TDT 0.6B v3 | Low-latency streaming; default |
whisper | Transformers Whisper | GPU servers with full HF integration |
whisper-mlx | Lightning Whisper MLX | Apple Silicon (legacy) |
mlx-audio-whisper | MLX Audio Whisper | Apple Silicon (fast) |
faster-whisper | CTranslate2 Faster-Whisper | CPU/GPU with quantized inference |
paraformer | FunASR Paraformer | Mandarin and multilingual ASR |
Selects the language model backend:
| Value | Description |
|---|---|
responses-api | OpenAI-compatible Responses API (OpenAI, HF Inference, vLLM, llama.cpp, …) |
chat-completions | OpenAI-compatible /v1/chat/completions endpoint |
transformers | Local inference via Hugging Face Transformers |
mlx-lm | Local inference on Apple Silicon via MLX |
Selects the Text-to-Speech backend:
| Value | Backend | Notes |
|---|---|---|
qwen3 | Qwen3-TTS | Default; GGML on Linux/Windows, MLX on Apple Silicon |
kokoro | Kokoro-82M | Fast, high-quality; optimised for Apple Silicon |
pocket | Pocket TTS | Streaming TTS with voice cloning from Kyutai Labs |
chatTTS | ChatTTS | Streaming synthesis |
facebookMMS | Facebook MMS | Multilingual coverage |
melo | MeloTTS | Deprecated; available in archive/ |
Sets the logging verbosity for all pipeline components. Accepts standard Python logging level names.
When
true, streams partial transcription results to connected clients while the user is speaking. Works with the Parakeet TDT backend in realtime mode and surfaces as conversation.item.input_audio_transcription.delta events.How often (in seconds) the live transcription result is updated and emitted. Smaller values give more frequent partial updates at the cost of slightly higher CPU usage.
Minimum silence duration in milliseconds before the live transcription considers the speech segment complete. Increase this value if transcription cuts off too early in quiet environments.
Number of isolated realtime pipeline instances in the pool. Each pipeline has its own VAD, STT, LLM, and TTS handlers plus its own conversation state. The single uvicorn server on
--ws_port routes each incoming WebSocket to the next free pipeline. Connections beyond num_pipelines are rejected. Only valid with --mode realtime.