Text-to-Speech Backends and Configuration

The Text-to-Speech stage converts LLM sentence chunks into 16-bit PCM audio that is streamed to the client. Select a backend with --tts. The default backend is qwen3, which is bundled in the standard install; all other backends require an optional extra.

Backend selection

`--tts` value	Handler class	Default install	Best for
`qwen3`	`Qwen3TTSHandler`	✅ included	Default; CUDA/CPU via GGML, Apple Silicon via mlx-audio
`pocket`	`PocketTTSHandler`	❌ `[pocket]` extra	Streaming TTS with voice cloning from Kyutai Labs
`kokoro`	`KokoroTTSHandler`	❌ `[kokoro]` extra	82 M lightweight multilingual TTS, MLX on Apple Silicon
`chatTTS`	`ChatTTSHandler`	❌ separate install	Streaming TTS, chunk-based generation
`facebookMMS`	`FacebookMMSTTSHandler`	❌ separate install	Multilingual MMS with automatic language switching

`qwen3` — Qwen3-TTS (default)

Qwen3TTSHandler is the default TTS backend. It supports three generation modes — voice cloning (reference audio), custom voice (preset speakers), and voice design (instruct prompt) — and automatically selects the right inference stack per platform. Platform dispatch:

Apple Silicon (Darwin): uses mlx-audio with an mlx-community/ model. Qwen/ model IDs are automatically mapped to mlx-community/ equivalents and default to the 6bit MLX quantization unless overridden.
Linux / Windows (CUDA or CPU): uses faster-qwen3-tts with the GGML backend by default. Pass --qwen3_tts_backend torch to use the CUDA-graphs implementation.

Default model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Default speaker: Aiden

CUDA wheel note

The default PyPI qwentts-cpp-python wheel targets CUDA 12.8. Install the matching wheel from the Hugging Face wheelhouse before installing speech-to-speech if your CUDA runtime differs:

# CUDA 13.x
pip install "qwentts-cpp-python==0.3.0+cu130" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cu130

# CUDA 12.4
pip install "qwentts-cpp-python==0.3.0+cu124" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cu124

# CPU-only
pip install "qwentts-cpp-python==0.3.0+cpu" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cpu

pip install speech-to-speech

Configuration reference

--qwen3_tts_model_name

str

default:"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"

HuggingFace Hub model ID or local path. On Apple Silicon, Qwen/* IDs are mapped to the corresponding mlx-community/* model with the 6bit suffix by default.

--qwen3_tts_device

str

default:"cuda"

Device for inference: cuda, cpu, mps, auto. On Apple Silicon the mlx-audio backend is selected automatically regardless of this flag.

--qwen3_tts_dtype

str

default:"auto"

Data type for inference: auto, float16, bfloat16, float32. Default is auto.

--qwen3_tts_attn_implementation

str

default:"eager"

Attention implementation: eager, flash_attention_2, sdpa. Use eager on Jetson and other edge devices. Default is eager.

--qwen3_tts_backend

str

default:"ggml"

faster-qwen3-tts backend on non-macOS platforms: ggml or torch. ggml uses the GGML quantized kernel (default); torch uses the CUDA-graphs PyTorch path. Ignored on Apple Silicon.

--qwen3_tts_speaker

str

default:"Aiden"

Speaker name for CustomVoice models. If unset, the first supported speaker is used. To see available speakers, query model.get_supported_speakers().

--qwen3_tts_language

str

default:"auto"

Target language for synthesis. auto lets the model infer the language. Supported aliases include zh, en, ja, ko, de, fr, ru, pt, es, it and their variants.

--qwen3_tts_ref_audio

str

default:"None"

Path to a reference audio file for voice cloning. Leave unset when using a CustomVoice or VoiceDesign model.

--qwen3_tts_ref_text

str

default:"(built-in sample text)"

Transcription of the reference audio for voice cloning. Only used when --qwen3_tts_ref_audio is set.

--qwen3_tts_instruct

str

default:"None"

Instruction text for VoiceDesign models. Required when using a voice design model; leave unset for CustomVoice or voice cloning.

--qwen3_tts_xvec_only

bool

default:"False"

Use x-vector only voice cloning mode. Recommended for cleaner starts and language switching when doing voice cloning.

--qwen3_tts_parity_mode

bool

default:"False"

Disable the CUDA-graph streaming path and use parity mode for stability. Useful for debugging or on hardware where CUDA graphs cause issues.

--qwen3_tts_mlx_quantization

str

default:"6bit"

MLX quantization variant on Apple Silicon: bf16, 4bit, 6bit, or 8bit. Defaults to 6bit for a good quality/speed balance.

--qwen3_tts_non_streaming_mode

bool

default:"True"

When True, pre-fills the full target text before decoding on faster-qwen3-tts. Currently ignored on Apple Silicon (mlx-audio does not expose this option yet).

--qwen3_tts_max_new_tokens

int

default:"1536"

Upper cap for Qwen3-TTS codec tokens. The handler estimates a per-utterance budget from the text length (~12 tokens/s of audio) and clamps it to this ceiling. Raise above 1536 for very long utterances.

--qwen3_tts_streaming_chunk_size

int

default:"None"

Codec steps per streaming chunk. When unset, defaults to 8 on faster-qwen3-tts and 4 on mlx-audio.

--qwen3_tts_blocksize

int

default:"512"

Audio chunk size in samples for streaming output. Must match the LocalAudioStreamer blocksize. Default is 512.

Default (CUDA + GGML)
Apple Silicon (6-bit MLX)
Voice cloning
Torch backend (CUDA-graphs)

speech-to-speech \
    --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_speaker Aiden \
    --qwen3_tts_backend ggml

speech-to-speech \
    --local_mac_optimal_settings \
    --tts qwen3 \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_mlx_quantization 6bit \
    --qwen3_tts_speaker Aiden

speech-to-speech \
    --tts qwen3 \
    --qwen3_tts_ref_audio /path/to/reference.wav \
    --qwen3_tts_ref_text "The transcription of the reference audio." \
    --qwen3_tts_language en

speech-to-speech \
    --tts qwen3 \
    --qwen3_tts_backend torch \
    --qwen3_tts_model_name Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice \
    --qwen3_tts_speaker Aiden

`pocket` — Pocket TTS (Kyutai Labs)

PocketTTSHandler uses Pocket TTS from Kyutai Labs for streaming TTS with voice cloning. The model generates audio at 24 kHz internally and is resampled to 16 kHz for the pipeline. Install: pip install "speech-to-speech[pocket]"

Pocket TTS requires numpy>=2, which conflicts with DeepFilterNet (numpy<2). Do not use --audio_enhancement in the same environment as Pocket TTS.

Voice presets

alba, marius, javert, jean (default), fantine, cosette, eponine, azelma You can also pass a local audio file path or a HuggingFace path (hf://kyutai/tts-voices/...) to --pocket_tts_voice for custom voice cloning.

Configuration reference

--pocket_tts_device

str

default:"cpu"

Device to run the model on: cpu, cuda, mps. Default is cpu for broad compatibility; switch to cuda or mps for faster inference.

--pocket_tts_voice

str

default:"jean"

Voice preset name, local audio file path, or hf://kyutai/tts-voices/... path for voice cloning.

--pocket_tts_sample_rate

int

default:"16000"

Output sample rate in Hz. The model generates at 24 kHz internally and resamples to this rate. Default of 16000 matches the pipeline audio streamer.

--pocket_tts_blocksize

int

default:"512"

Audio block size in samples for streaming output.

--pocket_tts_max_tokens

int

default:"50"

Maximum tokens to generate per sentence.

CPU with preset voice
CUDA with voice cloning
HuggingFace voice preset

speech-to-speech \
    --tts pocket \
    --pocket_tts_voice jean \
    --pocket_tts_device cpu

speech-to-speech \
    --tts pocket \
    --pocket_tts_voice /path/to/speaker.wav \
    --pocket_tts_device cuda

speech-to-speech \
    --tts pocket \
    --pocket_tts_voice "hf://kyutai/tts-voices/alba.wav" \
    --pocket_tts_device cpu

`kokoro` — Kokoro-82M

KokoroTTSHandler uses the 82 M-parameter Kokoro model. It supports 8 languages and auto-switches voice and language based on the STT language code received from upstream. Install: pip install "speech-to-speech[kokoro]" Platform dispatch:

Apple Silicon (MPS): loads mlx-community/Kokoro-82M-bf16 via mlx-audio.
CUDA / CPU: loads hexgrad/Kokoro-82M via the native kokoro library (requires espeak-ng).

Supported languages and default voices:

Lang code	Language	Default voice
`a`	American English	`af_heart`
`b`	British English	`bm_fable` (default)
`e`	Spanish	`ef_dora`
`f`	French	`ff_siwis`
`h`	Hindi	`hf_alpha`
`i`	Italian	`if_sara`
`j`	Japanese	`jf_alpha`
`p`	Portuguese	`pf_dora`
`z`	Chinese	`zf_xiaobei`

Configuration reference

--kokoro_model_name

str

default:"None (auto)"

Model ID override. Auto-selects mlx-community/Kokoro-82M-bf16 on MPS and hexgrad/Kokoro-82M on CUDA/CPU.

--kokoro_device

str

default:"auto"

Device: auto, cuda, mps, cpu.

--kokoro_voice

str

default:"bm_fable"

Voice identifier. See the Kokoro VOICES.md for the full list.

--kokoro_lang_code

str

default:"b"

Language code: a (American English), b (British English), e (Spanish), f (French), h (Hindi), i (Italian), j (Japanese), p (Portuguese), z (Chinese).

--kokoro_speed

float

default:"1.0"

Speech speed multiplier. Values above 1.0 speed up; values below 1.0 slow down.

--kokoro_blocksize

int

default:"512"

Audio chunk size in samples for streaming output. Default is 512.

British English
Japanese (Apple Silicon)
Spanish

speech-to-speech \
    --tts kokoro \
    --kokoro_voice bm_fable \
    --kokoro_lang_code b

speech-to-speech \
    --local_mac_optimal_settings \
    --tts kokoro \
    --kokoro_voice jf_alpha \
    --kokoro_lang_code j

speech-to-speech \
    --tts kokoro \
    --kokoro_voice ef_dora \
    --kokoro_lang_code e

When --language auto is set, the Kokoro handler maps incoming STT language codes to the nearest Kokoro language and auto-switches voice. Languages without a native Kokoro voice (e.g. German, Dutch) fall back to British English (b).

`chatTTS` — ChatTTS

ChatTTSHandler uses ChatTTS for streaming chunk-based synthesis. Arguments use the --chat_tts_* prefix.

Argument	Default	Description
`--chat_tts_device`	`cuda`	Device for inference
`--chat_tts_stream`	`True`	Enable chunk-level streaming
`--chat_tts_chunk_size`	`512`	Tokens per synthesis chunk

CUDA streaming

speech-to-speech \
    --tts chatTTS \
    --chat_tts_device cuda \
    --chat_tts_stream true \
    --chat_tts_chunk_size 512

`facebookMMS` — Facebook MMS

FacebookMMSTTSHandler uses Meta’s Massively Multilingual Speech (MMS) TTS models. It maps STT language codes (e.g. en, fr, es) to MMS model suffixes (e.g. eng, fra, spa) and reloads the model automatically on language changes. Use this backend for multilingual pipelines where the TTS must match the user’s language. Arguments use the --facebook_mms_* prefix. The TTS language is set with --tts_language.

Argument	Default	Description
`--facebook_mms_model_name`	`facebook/mms-tts-eng`	HuggingFace model ID for the MMS TTS model
`--tts_language`	`en`	Language code for synthesis (e.g. `en`, `fr`, `es`)
`--facebook_mms_device`	`cuda`	Device for inference
`--facebook_mms_torch_dtype`	`float32`	Precision: `float32`, `float16`, `bfloat16`

English
Auto language switching

speech-to-speech \
    --tts facebookMMS \
    --facebook_mms_device cuda \
    --tts_language en

speech-to-speech \
    --stt whisper \
    --language auto \
    --tts facebookMMS \
    --facebook_mms_device cuda \
    --tts_language auto

Benchmarking TTS backends on Apple Silicon

python scripts/benchmark_tts.py \
    --handlers qwen3 \
    --iterations 3 \
    --qwen3_mlx_quantizations bf16 4bit 6bit 8bit

This runs separate benchmark entries for qwen3[bf16], qwen3[4bit], qwen3[6bit], and qwen3[8bit] and prints time-to-first-audio and real-time factor for each variant.

Get Started

Pipeline Modes

Pipeline Components

Guides

Text-to-Speech Backends and Configuration

Backend selection

`qwen3` — Qwen3-TTS (default)

CUDA wheel note

Configuration reference

`pocket` — Pocket TTS (Kyutai Labs)

Voice presets

Configuration reference

`kokoro` — Kokoro-82M

Configuration reference

`chatTTS` — ChatTTS

`facebookMMS` — Facebook MMS

Benchmarking TTS backends on Apple Silicon

Build docs developers (and LLMs) love

Get Started

Pipeline Modes

Pipeline Components

Guides

Documentation Index

​Backend selection

​qwen3 — Qwen3-TTS (default)

​CUDA wheel note

​Configuration reference

​pocket — Pocket TTS (Kyutai Labs)

​Voice presets

​Configuration reference

​kokoro — Kokoro-82M

​Configuration reference

​chatTTS — ChatTTS

​facebookMMS — Facebook MMS

​Benchmarking TTS backends on Apple Silicon

Build docs developers (and LLMs) love

Backend selection

`qwen3` — Qwen3-TTS (default)

CUDA wheel note

Configuration reference

`pocket` — Pocket TTS (Kyutai Labs)

Voice presets

Configuration reference

`kokoro` — Kokoro-82M

Configuration reference

`chatTTS` — ChatTTS

`facebookMMS` — Facebook MMS

Benchmarking TTS backends on Apple Silicon