Documentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
The TTS stage converts text tokens from the LLM into audio streamed back to the client. Five backends are available, selected via --tts <value> on ModuleArguments. Each backend uses its own flag prefix so all argument classes can coexist in the same namespace.
# Qwen3-TTS (default)
speech-to-speech --tts qwen3 --qwen3_tts_speaker Aiden
# Kokoro
speech-to-speech --tts kokoro --kokoro_voice bm_fable
# Pocket TTS
speech-to-speech --tts pocket --pocket_tts_voice jean
# ChatTTS
speech-to-speech --tts chatTTS --chat_tts_device cuda
# Facebook MMS
speech-to-speech --tts facebookMMS --tts_language en
Qwen3-TTS
Kokoro
Pocket TTS
ChatTTS
Facebook MMS
Prefix: --qwen3_tts_
Backend value: --tts qwen3 (default)Qwen3-TTS is the default TTS backend. On non-macOS platforms it uses the faster-qwen3-tts GGML backend by default. On Apple Silicon it automatically selects mlx-audio with a 6-bit quantized MLX variant.qwen3_tts_model_name
string
default:"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
Hugging Face Hub ID or local path for the Qwen3-TTS model. On Apple Silicon, Qwen/* model IDs are automatically mapped to the corresponding mlx-community/* model (defaulting to the 6-bit MLX variant).speech-to-speech --tts qwen3 \
--qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Preferred device. Options: cuda, cpu, mps, auto. On Apple Silicon the MLX audio backend is selected automatically regardless of this flag.speech-to-speech --tts qwen3 --qwen3_tts_device cuda
Data type for inference. Options: auto, float16, bfloat16, float32.speech-to-speech --tts qwen3 --qwen3_tts_dtype float16
qwen3_tts_attn_implementation
Attention implementation. Options: eager, flash_attention_2, sdpa. Use eager on Jetson devices.speech-to-speech --tts qwen3 --qwen3_tts_attn_implementation sdpa
qwen3_tts_backend
'ggml' | 'torch'
default:"ggml"
faster-qwen3-tts backend on non-macOS platforms. ggml uses the GGML path from qwentts-cpp-python; torch uses the CUDA-graphs implementation. On Apple Silicon this flag is ignored and mlx-audio is used.speech-to-speech --tts qwen3 --qwen3_tts_backend torch
Speaker name for CustomVoice model variants. If not provided, the first supported speaker is used. Only applies to CustomVoice model checkpoints.speech-to-speech --tts qwen3 --qwen3_tts_speaker Aiden
Target synthesis language. auto lets the model determine the language from text content.speech-to-speech --tts qwen3 --qwen3_tts_language en
qwen3_tts_non_streaming_mode
When true, pre-fills the full target text before decode on faster-qwen3-tts. Currently ignored on Apple Silicon because mlx-audio does not expose this option yet.speech-to-speech --tts qwen3 --qwen3_tts_non_streaming_mode True
qwen3_tts_mlx_quantization
MLX quantization level on Apple Silicon. Options: bf16, 4bit, 6bit, 8bit. Only used when mlx-audio is selected automatically on macOS.speech-to-speech --tts qwen3 --qwen3_tts_mlx_quantization 4bit
Path to a reference audio file for voice cloning. Leave unset when using a CustomVoice model.speech-to-speech --tts qwen3 --qwen3_tts_ref_audio /path/to/ref.wav
Transcription of the reference audio file used for voice cloning. Required when --qwen3_tts_ref_audio is set.speech-to-speech --tts qwen3 \
--qwen3_tts_ref_audio ref.wav \
--qwen3_tts_ref_text "Hello, this is my reference voice."
Instruction text for VoiceDesign model variants. Required when using a VoiceDesign checkpoint.speech-to-speech --tts qwen3 \
--qwen3_tts_instruct "Speak in a calm, professional tone."
Use x-vector only voice cloning mode. Recommended for cleaner utterance starts and language switching scenarios.speech-to-speech --tts qwen3 --qwen3_tts_xvec_only
Disable the CUDA-graph streaming path and fall back to parity mode for improved stability on hardware where CUDA graphs cause issues.speech-to-speech --tts qwen3 --qwen3_tts_parity_mode
qwen3_tts_streaming_chunk_size
Codec steps per streaming chunk. When unset the handler uses a backend-specific default: 8 on faster-qwen3-tts and 4 on mlx-audio. Smaller values reduce first-audio latency; larger values reduce overhead.speech-to-speech --tts qwen3 --qwen3_tts_streaming_chunk_size 4
Upper cap for codec tokens generated per utterance. The handler estimates a per-utterance budget from the text and clamps it to this ceiling (approximately 12 tokens per second of audio). Raise above 1536 for longer utterances.speech-to-speech --tts qwen3 --qwen3_tts_max_new_tokens 2048
Audio chunk size in samples for streaming output. Must match the LocalAudioStreamer blocksize.speech-to-speech --tts qwen3 --qwen3_tts_blocksize 512
Prefix: --kokoro_
Backend value: --tts kokoroKokoro-82M is a fast, high-quality TTS model optimized for Apple Silicon. Install with pip install "speech-to-speech[kokoro]". On MPS it auto-selects mlx-community/Kokoro-82M-bf16; on CUDA/CPU it uses hexgrad/Kokoro-82M.Kokoro model to load. When unset, auto-selected based on device: mlx-community/Kokoro-82M-bf16 for MPS, hexgrad/Kokoro-82M for CUDA/CPU.speech-to-speech --tts kokoro --kokoro_model_name hexgrad/Kokoro-82M
Device to run Kokoro on. Options: auto, cuda, cpu, mps. auto selects MPS on Mac and CUDA on GPU systems.speech-to-speech --tts kokoro --kokoro_device mps
Voice preset to use for synthesis. See the VOICES.md file in the Kokoro repository for a full list of available voices.speech-to-speech --tts kokoro --kokoro_voice af_sky
Language code prefix. a for American English, b for British English, j for Japanese, and other codes as documented in the Kokoro model card.speech-to-speech --tts kokoro --kokoro_lang_code a --kokoro_voice af_sky
Speech speed multiplier. Values above 1.0 speed up delivery; values below 1.0 slow it down.speech-to-speech --tts kokoro --kokoro_speed 1.15
Audio chunk size in samples for streaming output.speech-to-speech --tts kokoro --kokoro_blocksize 512
Prefix: --pocket_tts_
Backend value: --tts pocketPocket TTS from Kyutai Labs provides streaming TTS with voice cloning. Requires numpy>=2 and is incompatible with DeepFilterNet audio enhancement. Install with pip install "speech-to-speech[pocket]".Available built-in voice presets: alba, marius, javert, jean, fantine, cosette, eponine, azelma.Device to run the Pocket TTS model on. Options: cpu, cuda, mps.speech-to-speech --tts pocket --pocket_tts_device cuda
Voice to use for synthesis. Can be:
- A built-in preset name (
alba, marius, javert, jean, fantine, cosette, eponine, azelma)
- A local audio file path for voice cloning
- A Hugging Face path such as
hf://kyutai/tts-voices/...
speech-to-speech --tts pocket --pocket_tts_voice fantine
Output sample rate in Hz. Pocket TTS synthesizes at 24 kHz internally and resamples to this rate to match the pipeline’s audio streamer.speech-to-speech --tts pocket --pocket_tts_sample_rate 16000
Size of audio blocks yielded per streaming iteration.speech-to-speech --tts pocket --pocket_tts_blocksize 512
Maximum number of tokens to generate per sentence in streaming mode.speech-to-speech --tts pocket --pocket_tts_max_tokens 80
Pocket TTS requires numpy>=2 and conflicts with DeepFilterNet audio enhancement (--audio_enhancement). Do not combine both in the same environment.
Prefix: --chat_tts_
Backend value: --tts chatTTSChatTTS provides streaming Chinese and English synthesis.Whether to use ChatTTS in streaming mode. Keep true for low-latency voice pipelines.speech-to-speech --tts chatTTS --chat_tts_stream
Device to run ChatTTS on.speech-to-speech --tts chatTTS --chat_tts_device cpu
Audio data chunk size processed per cycle, in samples. Smaller values reduce playback latency; larger values reduce CPU overhead.speech-to-speech --tts chatTTS --chat_tts_chunk_size 256
Prefix: --facebook_mms_ (model/device/dtype) and --tts_ (language)
Backend value: --tts facebookMMSFacebook MMS (Massively Multilingual Speech) provides broad language coverage across hundreds of languages.facebook_mms_model_name
string
default:"facebook/mms-tts-eng"
Hugging Face Hub model ID for the MMS TTS checkpoint. Change the language suffix to select a different language (e.g. facebook/mms-tts-fra for French).speech-to-speech --tts facebookMMS \
--facebook_mms_model_name facebook/mms-tts-fra
ISO 639-3 language code that is forwarded to the model. Ensure this matches the language of the loaded checkpoint.speech-to-speech --tts facebookMMS --tts_language fra \
--facebook_mms_model_name facebook/mms-tts-fra
Device to run the MMS model on.speech-to-speech --tts facebookMMS --facebook_mms_device cpu
PyTorch data type for the MMS model. MMS is a smaller model and runs well in float32.speech-to-speech --tts facebookMMS --facebook_mms_torch_dtype float16
Comparing TTS backends on Apple Silicon
python scripts/benchmark_tts.py \
--handlers qwen3 \
--iterations 3 \
--qwen3_mlx_quantizations bf16 4bit 6bit 8bit
On Apple Silicon, Qwen3-TTS with --qwen3_tts_mlx_quantization 6bit typically delivers the best balance of quality and latency. Kokoro and Pocket TTS are also solid alternatives for different voice styles.