Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

The TTS stage converts text tokens from the LLM into audio streamed back to the client. Five backends are available, selected via --tts <value> on ModuleArguments. Each backend uses its own flag prefix so all argument classes can coexist in the same namespace.
# Qwen3-TTS (default)
speech-to-speech --tts qwen3 --qwen3_tts_speaker Aiden

# Kokoro
speech-to-speech --tts kokoro --kokoro_voice bm_fable

# Pocket TTS
speech-to-speech --tts pocket --pocket_tts_voice jean

# ChatTTS
speech-to-speech --tts chatTTS --chat_tts_device cuda

# Facebook MMS
speech-to-speech --tts facebookMMS --tts_language en
Prefix: --qwen3_tts_
Backend value: --tts qwen3 (default)
Qwen3-TTS is the default TTS backend. On non-macOS platforms it uses the faster-qwen3-tts GGML backend by default. On Apple Silicon it automatically selects mlx-audio with a 6-bit quantized MLX variant.
qwen3_tts_model_name
string
default:"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
Hugging Face Hub ID or local path for the Qwen3-TTS model. On Apple Silicon, Qwen/* model IDs are automatically mapped to the corresponding mlx-community/* model (defaulting to the 6-bit MLX variant).
speech-to-speech --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
qwen3_tts_device
string
default:"cuda"
Preferred device. Options: cuda, cpu, mps, auto. On Apple Silicon the MLX audio backend is selected automatically regardless of this flag.
speech-to-speech --tts qwen3 --qwen3_tts_device cuda
qwen3_tts_dtype
string
default:"auto"
Data type for inference. Options: auto, float16, bfloat16, float32.
speech-to-speech --tts qwen3 --qwen3_tts_dtype float16
qwen3_tts_attn_implementation
string
default:"eager"
Attention implementation. Options: eager, flash_attention_2, sdpa. Use eager on Jetson devices.
speech-to-speech --tts qwen3 --qwen3_tts_attn_implementation sdpa
qwen3_tts_backend
'ggml' | 'torch'
default:"ggml"
faster-qwen3-tts backend on non-macOS platforms. ggml uses the GGML path from qwentts-cpp-python; torch uses the CUDA-graphs implementation. On Apple Silicon this flag is ignored and mlx-audio is used.
speech-to-speech --tts qwen3 --qwen3_tts_backend torch
qwen3_tts_speaker
string
default:"Aiden"
Speaker name for CustomVoice model variants. If not provided, the first supported speaker is used. Only applies to CustomVoice model checkpoints.
speech-to-speech --tts qwen3 --qwen3_tts_speaker Aiden
qwen3_tts_language
string
default:"auto"
Target synthesis language. auto lets the model determine the language from text content.
speech-to-speech --tts qwen3 --qwen3_tts_language en
qwen3_tts_non_streaming_mode
boolean
default:"true"
When true, pre-fills the full target text before decode on faster-qwen3-tts. Currently ignored on Apple Silicon because mlx-audio does not expose this option yet.
speech-to-speech --tts qwen3 --qwen3_tts_non_streaming_mode True
qwen3_tts_mlx_quantization
string
default:"6bit"
MLX quantization level on Apple Silicon. Options: bf16, 4bit, 6bit, 8bit. Only used when mlx-audio is selected automatically on macOS.
speech-to-speech --tts qwen3 --qwen3_tts_mlx_quantization 4bit
qwen3_tts_ref_audio
string
Path to a reference audio file for voice cloning. Leave unset when using a CustomVoice model.
speech-to-speech --tts qwen3 --qwen3_tts_ref_audio /path/to/ref.wav
qwen3_tts_ref_text
string
Transcription of the reference audio file used for voice cloning. Required when --qwen3_tts_ref_audio is set.
speech-to-speech --tts qwen3 \
    --qwen3_tts_ref_audio ref.wav \
    --qwen3_tts_ref_text "Hello, this is my reference voice."
qwen3_tts_instruct
string
Instruction text for VoiceDesign model variants. Required when using a VoiceDesign checkpoint.
speech-to-speech --tts qwen3 \
    --qwen3_tts_instruct "Speak in a calm, professional tone."
qwen3_tts_xvec_only
boolean
default:"false"
Use x-vector only voice cloning mode. Recommended for cleaner utterance starts and language switching scenarios.
speech-to-speech --tts qwen3 --qwen3_tts_xvec_only
qwen3_tts_parity_mode
boolean
default:"false"
Disable the CUDA-graph streaming path and fall back to parity mode for improved stability on hardware where CUDA graphs cause issues.
speech-to-speech --tts qwen3 --qwen3_tts_parity_mode
qwen3_tts_streaming_chunk_size
integer
Codec steps per streaming chunk. When unset the handler uses a backend-specific default: 8 on faster-qwen3-tts and 4 on mlx-audio. Smaller values reduce first-audio latency; larger values reduce overhead.
speech-to-speech --tts qwen3 --qwen3_tts_streaming_chunk_size 4
qwen3_tts_max_new_tokens
integer
default:"1536"
Upper cap for codec tokens generated per utterance. The handler estimates a per-utterance budget from the text and clamps it to this ceiling (approximately 12 tokens per second of audio). Raise above 1536 for longer utterances.
speech-to-speech --tts qwen3 --qwen3_tts_max_new_tokens 2048
qwen3_tts_blocksize
integer
default:"512"
Audio chunk size in samples for streaming output. Must match the LocalAudioStreamer blocksize.
speech-to-speech --tts qwen3 --qwen3_tts_blocksize 512

Comparing TTS backends on Apple Silicon

python scripts/benchmark_tts.py \
    --handlers qwen3 \
    --iterations 3 \
    --qwen3_mlx_quantizations bf16 4bit 6bit 8bit
On Apple Silicon, Qwen3-TTS with --qwen3_tts_mlx_quantization 6bit typically delivers the best balance of quality and latency. Kokoro and Pocket TTS are also solid alternatives for different voice styles.

Build docs developers (and LLMs) love