The Text-to-Speech stage converts LLM sentence chunks into 16-bit PCM audio that is streamed to the client. Select a backend withDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
--tts. The default backend is qwen3, which is bundled in the standard install; all other backends require an optional extra.
Backend selection
--tts value | Handler class | Default install | Best for |
|---|---|---|---|
qwen3 | Qwen3TTSHandler | ✅ included | Default; CUDA/CPU via GGML, Apple Silicon via mlx-audio |
pocket | PocketTTSHandler | ❌ [pocket] extra | Streaming TTS with voice cloning from Kyutai Labs |
kokoro | KokoroTTSHandler | ❌ [kokoro] extra | 82 M lightweight multilingual TTS, MLX on Apple Silicon |
chatTTS | ChatTTSHandler | ❌ separate install | Streaming TTS, chunk-based generation |
facebookMMS | FacebookMMSTTSHandler | ❌ separate install | Multilingual MMS with automatic language switching |
qwen3 — Qwen3-TTS (default)
Qwen3TTSHandler is the default TTS backend. It supports three generation modes — voice cloning (reference audio), custom voice (preset speakers), and voice design (instruct prompt) — and automatically selects the right inference stack per platform.
Platform dispatch:
- Apple Silicon (Darwin): uses mlx-audio with an
mlx-community/model.Qwen/model IDs are automatically mapped tomlx-community/equivalents and default to the6bitMLX quantization unless overridden. - Linux / Windows (CUDA or CPU): uses
faster-qwen3-ttswith the GGML backend by default. Pass--qwen3_tts_backend torchto use the CUDA-graphs implementation.
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoiceDefault speaker:
Aiden
CUDA wheel note
The default PyPIqwentts-cpp-python wheel targets CUDA 12.8. Install the matching wheel from the Hugging Face wheelhouse before installing speech-to-speech if your CUDA runtime differs:
Configuration reference
HuggingFace Hub model ID or local path. On Apple Silicon,
Qwen/* IDs are mapped to the corresponding mlx-community/* model with the 6bit suffix by default.Device for inference:
cuda, cpu, mps, auto. On Apple Silicon the mlx-audio backend is selected automatically regardless of this flag.Data type for inference:
auto, float16, bfloat16, float32. Default is auto.Attention implementation:
eager, flash_attention_2, sdpa. Use eager on Jetson and other edge devices. Default is eager.faster-qwen3-tts backend on non-macOS platforms:
ggml or torch. ggml uses the GGML quantized kernel (default); torch uses the CUDA-graphs PyTorch path. Ignored on Apple Silicon.Speaker name for CustomVoice models. If unset, the first supported speaker is used. To see available speakers, query
model.get_supported_speakers().Target language for synthesis.
auto lets the model infer the language. Supported aliases include zh, en, ja, ko, de, fr, ru, pt, es, it and their variants.Path to a reference audio file for voice cloning. Leave unset when using a CustomVoice or VoiceDesign model.
Transcription of the reference audio for voice cloning. Only used when
--qwen3_tts_ref_audio is set.Instruction text for VoiceDesign models. Required when using a voice design model; leave unset for CustomVoice or voice cloning.
Use x-vector only voice cloning mode. Recommended for cleaner starts and language switching when doing voice cloning.
Disable the CUDA-graph streaming path and use parity mode for stability. Useful for debugging or on hardware where CUDA graphs cause issues.
MLX quantization variant on Apple Silicon:
bf16, 4bit, 6bit, or 8bit. Defaults to 6bit for a good quality/speed balance.When
True, pre-fills the full target text before decoding on faster-qwen3-tts. Currently ignored on Apple Silicon (mlx-audio does not expose this option yet).Upper cap for Qwen3-TTS codec tokens. The handler estimates a per-utterance budget from the text length (~12 tokens/s of audio) and clamps it to this ceiling. Raise above 1536 for very long utterances.
Codec steps per streaming chunk. When unset, defaults to
8 on faster-qwen3-tts and 4 on mlx-audio.Audio chunk size in samples for streaming output. Must match the
LocalAudioStreamer blocksize. Default is 512.- Default (CUDA + GGML)
- Apple Silicon (6-bit MLX)
- Voice cloning
- Torch backend (CUDA-graphs)
pocket — Pocket TTS (Kyutai Labs)
PocketTTSHandler uses Pocket TTS from Kyutai Labs for streaming TTS with voice cloning. The model generates audio at 24 kHz internally and is resampled to 16 kHz for the pipeline.
Install: pip install "speech-to-speech[pocket]"
Voice presets
alba, marius, javert, jean (default), fantine, cosette, eponine, azelma
You can also pass a local audio file path or a HuggingFace path (hf://kyutai/tts-voices/...) to --pocket_tts_voice for custom voice cloning.
Configuration reference
Device to run the model on:
cpu, cuda, mps. Default is cpu for broad compatibility; switch to cuda or mps for faster inference.Voice preset name, local audio file path, or
hf://kyutai/tts-voices/... path for voice cloning.Output sample rate in Hz. The model generates at 24 kHz internally and resamples to this rate. Default of 16000 matches the pipeline audio streamer.
Audio block size in samples for streaming output.
Maximum tokens to generate per sentence.
- CPU with preset voice
- CUDA with voice cloning
- HuggingFace voice preset
kokoro — Kokoro-82M
KokoroTTSHandler uses the 82 M-parameter Kokoro model. It supports 8 languages and auto-switches voice and language based on the STT language code received from upstream.
Install: pip install "speech-to-speech[kokoro]"
Platform dispatch:
- Apple Silicon (MPS): loads
mlx-community/Kokoro-82M-bf16via mlx-audio. - CUDA / CPU: loads
hexgrad/Kokoro-82Mvia the nativekokorolibrary (requiresespeak-ng).
| Lang code | Language | Default voice |
|---|---|---|
a | American English | af_heart |
b | British English | bm_fable (default) |
e | Spanish | ef_dora |
f | French | ff_siwis |
h | Hindi | hf_alpha |
i | Italian | if_sara |
j | Japanese | jf_alpha |
p | Portuguese | pf_dora |
z | Chinese | zf_xiaobei |
Configuration reference
Model ID override. Auto-selects
mlx-community/Kokoro-82M-bf16 on MPS and hexgrad/Kokoro-82M on CUDA/CPU.Device:
auto, cuda, mps, cpu.Voice identifier. See the Kokoro
VOICES.md for the full list.Language code:
a (American English), b (British English), e (Spanish), f (French), h (Hindi), i (Italian), j (Japanese), p (Portuguese), z (Chinese).Speech speed multiplier. Values above 1.0 speed up; values below 1.0 slow down.
Audio chunk size in samples for streaming output. Default is 512.
- British English
- Japanese (Apple Silicon)
- Spanish
When
--language auto is set, the Kokoro handler maps incoming STT language codes to the nearest Kokoro language and auto-switches voice. Languages without a native Kokoro voice (e.g. German, Dutch) fall back to British English (b).chatTTS — ChatTTS
ChatTTSHandler uses ChatTTS for streaming chunk-based synthesis. Arguments use the --chat_tts_* prefix.
| Argument | Default | Description |
|---|---|---|
--chat_tts_device | cuda | Device for inference |
--chat_tts_stream | True | Enable chunk-level streaming |
--chat_tts_chunk_size | 512 | Tokens per synthesis chunk |
- CUDA streaming
facebookMMS — Facebook MMS
FacebookMMSTTSHandler uses Meta’s Massively Multilingual Speech (MMS) TTS models. It maps STT language codes (e.g. en, fr, es) to MMS model suffixes (e.g. eng, fra, spa) and reloads the model automatically on language changes. Use this backend for multilingual pipelines where the TTS must match the user’s language.
Arguments use the --facebook_mms_* prefix. The TTS language is set with --tts_language.
| Argument | Default | Description |
|---|---|---|
--facebook_mms_model_name | facebook/mms-tts-eng | HuggingFace model ID for the MMS TTS model |
--tts_language | en | Language code for synthesis (e.g. en, fr, es) |
--facebook_mms_device | cuda | Device for inference |
--facebook_mms_torch_dtype | float32 | Precision: float32, float16, bfloat16 |
- English
- Auto language switching
Benchmarking TTS backends on Apple Silicon
qwen3[bf16], qwen3[4bit], qwen3[6bit], and qwen3[8bit] and prints time-to-first-audio and real-time factor for each variant.