Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

The Text-to-Speech stage converts LLM sentence chunks into 16-bit PCM audio that is streamed to the client. Select a backend with --tts. The default backend is qwen3, which is bundled in the standard install; all other backends require an optional extra.

Backend selection

--tts valueHandler classDefault installBest for
qwen3Qwen3TTSHandler✅ includedDefault; CUDA/CPU via GGML, Apple Silicon via mlx-audio
pocketPocketTTSHandler[pocket] extraStreaming TTS with voice cloning from Kyutai Labs
kokoroKokoroTTSHandler[kokoro] extra82 M lightweight multilingual TTS, MLX on Apple Silicon
chatTTSChatTTSHandler❌ separate installStreaming TTS, chunk-based generation
facebookMMSFacebookMMSTTSHandler❌ separate installMultilingual MMS with automatic language switching

qwen3 — Qwen3-TTS (default)

Qwen3TTSHandler is the default TTS backend. It supports three generation modes — voice cloning (reference audio), custom voice (preset speakers), and voice design (instruct prompt) — and automatically selects the right inference stack per platform. Platform dispatch:
  • Apple Silicon (Darwin): uses mlx-audio with an mlx-community/ model. Qwen/ model IDs are automatically mapped to mlx-community/ equivalents and default to the 6bit MLX quantization unless overridden.
  • Linux / Windows (CUDA or CPU): uses faster-qwen3-tts with the GGML backend by default. Pass --qwen3_tts_backend torch to use the CUDA-graphs implementation.
Default model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Default speaker: Aiden

CUDA wheel note

The default PyPI qwentts-cpp-python wheel targets CUDA 12.8. Install the matching wheel from the Hugging Face wheelhouse before installing speech-to-speech if your CUDA runtime differs:
# CUDA 13.x
pip install "qwentts-cpp-python==0.3.0+cu130" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cu130

# CUDA 12.4
pip install "qwentts-cpp-python==0.3.0+cu124" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cu124

# CPU-only
pip install "qwentts-cpp-python==0.3.0+cpu" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cpu

pip install speech-to-speech

Configuration reference

--qwen3_tts_model_name
str
default:"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
HuggingFace Hub model ID or local path. On Apple Silicon, Qwen/* IDs are mapped to the corresponding mlx-community/* model with the 6bit suffix by default.
--qwen3_tts_device
str
default:"cuda"
Device for inference: cuda, cpu, mps, auto. On Apple Silicon the mlx-audio backend is selected automatically regardless of this flag.
--qwen3_tts_dtype
str
default:"auto"
Data type for inference: auto, float16, bfloat16, float32. Default is auto.
--qwen3_tts_attn_implementation
str
default:"eager"
Attention implementation: eager, flash_attention_2, sdpa. Use eager on Jetson and other edge devices. Default is eager.
--qwen3_tts_backend
str
default:"ggml"
faster-qwen3-tts backend on non-macOS platforms: ggml or torch. ggml uses the GGML quantized kernel (default); torch uses the CUDA-graphs PyTorch path. Ignored on Apple Silicon.
--qwen3_tts_speaker
str
default:"Aiden"
Speaker name for CustomVoice models. If unset, the first supported speaker is used. To see available speakers, query model.get_supported_speakers().
--qwen3_tts_language
str
default:"auto"
Target language for synthesis. auto lets the model infer the language. Supported aliases include zh, en, ja, ko, de, fr, ru, pt, es, it and their variants.
--qwen3_tts_ref_audio
str
default:"None"
Path to a reference audio file for voice cloning. Leave unset when using a CustomVoice or VoiceDesign model.
--qwen3_tts_ref_text
str
default:"(built-in sample text)"
Transcription of the reference audio for voice cloning. Only used when --qwen3_tts_ref_audio is set.
--qwen3_tts_instruct
str
default:"None"
Instruction text for VoiceDesign models. Required when using a voice design model; leave unset for CustomVoice or voice cloning.
--qwen3_tts_xvec_only
bool
default:"False"
Use x-vector only voice cloning mode. Recommended for cleaner starts and language switching when doing voice cloning.
--qwen3_tts_parity_mode
bool
default:"False"
Disable the CUDA-graph streaming path and use parity mode for stability. Useful for debugging or on hardware where CUDA graphs cause issues.
--qwen3_tts_mlx_quantization
str
default:"6bit"
MLX quantization variant on Apple Silicon: bf16, 4bit, 6bit, or 8bit. Defaults to 6bit for a good quality/speed balance.
--qwen3_tts_non_streaming_mode
bool
default:"True"
When True, pre-fills the full target text before decoding on faster-qwen3-tts. Currently ignored on Apple Silicon (mlx-audio does not expose this option yet).
--qwen3_tts_max_new_tokens
int
default:"1536"
Upper cap for Qwen3-TTS codec tokens. The handler estimates a per-utterance budget from the text length (~12 tokens/s of audio) and clamps it to this ceiling. Raise above 1536 for very long utterances.
--qwen3_tts_streaming_chunk_size
int
default:"None"
Codec steps per streaming chunk. When unset, defaults to 8 on faster-qwen3-tts and 4 on mlx-audio.
--qwen3_tts_blocksize
int
default:"512"
Audio chunk size in samples for streaming output. Must match the LocalAudioStreamer blocksize. Default is 512.
speech-to-speech \
    --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_speaker Aiden \
    --qwen3_tts_backend ggml

pocket — Pocket TTS (Kyutai Labs)

PocketTTSHandler uses Pocket TTS from Kyutai Labs for streaming TTS with voice cloning. The model generates audio at 24 kHz internally and is resampled to 16 kHz for the pipeline. Install: pip install "speech-to-speech[pocket]"
Pocket TTS requires numpy>=2, which conflicts with DeepFilterNet (numpy<2). Do not use --audio_enhancement in the same environment as Pocket TTS.

Voice presets

alba, marius, javert, jean (default), fantine, cosette, eponine, azelma You can also pass a local audio file path or a HuggingFace path (hf://kyutai/tts-voices/...) to --pocket_tts_voice for custom voice cloning.

Configuration reference

--pocket_tts_device
str
default:"cpu"
Device to run the model on: cpu, cuda, mps. Default is cpu for broad compatibility; switch to cuda or mps for faster inference.
--pocket_tts_voice
str
default:"jean"
Voice preset name, local audio file path, or hf://kyutai/tts-voices/... path for voice cloning.
--pocket_tts_sample_rate
int
default:"16000"
Output sample rate in Hz. The model generates at 24 kHz internally and resamples to this rate. Default of 16000 matches the pipeline audio streamer.
--pocket_tts_blocksize
int
default:"512"
Audio block size in samples for streaming output.
--pocket_tts_max_tokens
int
default:"50"
Maximum tokens to generate per sentence.
speech-to-speech \
    --tts pocket \
    --pocket_tts_voice jean \
    --pocket_tts_device cpu

kokoro — Kokoro-82M

KokoroTTSHandler uses the 82 M-parameter Kokoro model. It supports 8 languages and auto-switches voice and language based on the STT language code received from upstream. Install: pip install "speech-to-speech[kokoro]" Platform dispatch:
  • Apple Silicon (MPS): loads mlx-community/Kokoro-82M-bf16 via mlx-audio.
  • CUDA / CPU: loads hexgrad/Kokoro-82M via the native kokoro library (requires espeak-ng).
Supported languages and default voices:
Lang codeLanguageDefault voice
aAmerican Englishaf_heart
bBritish Englishbm_fable (default)
eSpanishef_dora
fFrenchff_siwis
hHindihf_alpha
iItalianif_sara
jJapanesejf_alpha
pPortuguesepf_dora
zChinesezf_xiaobei

Configuration reference

--kokoro_model_name
str
default:"None (auto)"
Model ID override. Auto-selects mlx-community/Kokoro-82M-bf16 on MPS and hexgrad/Kokoro-82M on CUDA/CPU.
--kokoro_device
str
default:"auto"
Device: auto, cuda, mps, cpu.
--kokoro_voice
str
default:"bm_fable"
Voice identifier. See the Kokoro VOICES.md for the full list.
--kokoro_lang_code
str
default:"b"
Language code: a (American English), b (British English), e (Spanish), f (French), h (Hindi), i (Italian), j (Japanese), p (Portuguese), z (Chinese).
--kokoro_speed
float
default:"1.0"
Speech speed multiplier. Values above 1.0 speed up; values below 1.0 slow down.
--kokoro_blocksize
int
default:"512"
Audio chunk size in samples for streaming output. Default is 512.
speech-to-speech \
    --tts kokoro \
    --kokoro_voice bm_fable \
    --kokoro_lang_code b
When --language auto is set, the Kokoro handler maps incoming STT language codes to the nearest Kokoro language and auto-switches voice. Languages without a native Kokoro voice (e.g. German, Dutch) fall back to British English (b).

chatTTS — ChatTTS

ChatTTSHandler uses ChatTTS for streaming chunk-based synthesis. Arguments use the --chat_tts_* prefix.
ArgumentDefaultDescription
--chat_tts_devicecudaDevice for inference
--chat_tts_streamTrueEnable chunk-level streaming
--chat_tts_chunk_size512Tokens per synthesis chunk
speech-to-speech \
    --tts chatTTS \
    --chat_tts_device cuda \
    --chat_tts_stream true \
    --chat_tts_chunk_size 512

facebookMMS — Facebook MMS

FacebookMMSTTSHandler uses Meta’s Massively Multilingual Speech (MMS) TTS models. It maps STT language codes (e.g. en, fr, es) to MMS model suffixes (e.g. eng, fra, spa) and reloads the model automatically on language changes. Use this backend for multilingual pipelines where the TTS must match the user’s language. Arguments use the --facebook_mms_* prefix. The TTS language is set with --tts_language.
ArgumentDefaultDescription
--facebook_mms_model_namefacebook/mms-tts-engHuggingFace model ID for the MMS TTS model
--tts_languageenLanguage code for synthesis (e.g. en, fr, es)
--facebook_mms_devicecudaDevice for inference
--facebook_mms_torch_dtypefloat32Precision: float32, float16, bfloat16
speech-to-speech \
    --tts facebookMMS \
    --facebook_mms_device cuda \
    --tts_language en

Benchmarking TTS backends on Apple Silicon

python scripts/benchmark_tts.py \
    --handlers qwen3 \
    --iterations 3 \
    --qwen3_mlx_quantizations bf16 4bit 6bit 8bit
This runs separate benchmark entries for qwen3[bf16], qwen3[4bit], qwen3[6bit], and qwen3[8bit] and prints time-to-first-audio and real-time factor for each variant.

Build docs developers (and LLMs) love