The speech-to-speech pipeline supports six Speech-to-Text backends that cover the full spectrum from cloud-free local inference on Apple Silicon to high-accuracy multilingual transcription on CUDA. Select a backend withDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
--stt; all other STT parameters follow an --<handler_prefix>_* naming convention described below.
Backend selection
--stt value | Handler class | Best for |
|---|---|---|
whisper | WhisperSTTHandler | Any Whisper checkpoint on Hugging Face Hub via Transformers |
whisper-mlx | LightningWhisperSTTHandler | Fast Whisper inference on Apple Silicon via Lightning Whisper MLX |
mlx-audio-whisper | MLXAudioWhisperSTTHandler | Fast Whisper on Apple Silicon via mlx-audio |
faster-whisper | FasterWhisperSTTHandler | CTranslate2-accelerated Whisper on CUDA/CPU |
parakeet-tdt | ParakeetTDTSTTHandler | Default; streaming ASR on Apple Silicon (MLX) and CUDA (nano-parakeet) |
paraformer | ParaformerSTTHandler | Chinese-optimized FunASR Paraformer model |
parakeet-tdt. It is the only STT backend included in the standard pip install speech-to-speech install. All other backends require an optional extra or a separate install.
Language support matrix
| Backend | Supported languages | Auto-detect |
|---|---|---|
whisper | en, fr, es, zh, ja, ko, hi, de, pt, pl, it, nl (+ fallback) | ✅ (--language auto) |
whisper-mlx | en, fr, es, zh, ja, ko, hi, de, pt, pl, it, nl | ✅ |
mlx-audio-whisper | en, fr, es, zh, ja, ko, hi, de, pt, pl, it, nl | ✅ |
faster-whisper | Depends on checkpoint; gen arg --faster_whisper_stt_gen_language | Depends on model |
parakeet-tdt | 25 European languages (en, de, fr, es, it, pt, nl, pl, ru, uk, cs, sk, hu, ro, bg, hr, sl, sr, da, no, sv, fi, et, lv, lt) | ✅ (lingua-py) |
paraformer | Depends on FunASR checkpoint; default paraformer-zh is Chinese | ❌ |
--language <code> to fix the language or --language auto to enable per-utterance language detection. Pass --enable_lang_prompt to append a "Please reply to my message in <language>" instruction to the LLM so smaller models stay in the detected language.
Live transcription flags
Two flags enable in-stream partial transcription (Parakeet TDT and Paraformer support this natively):--enable_live_transcription is set, the VAD stage emits VADAudio(mode="progressive") chunks at --realtime_processing_pause intervals, the STT handler transcribes them, and the realtime server forwards partial transcripts as conversation.item.input_audio_transcription.delta WebSocket events.
Argument prefix pattern
STT parameters follow the--<handler_prefix>_* naming convention, with generation parameters under --<handler_prefix>_gen_*. The prefix for each backend is:
| Backend | Argument prefix | Example |
|---|---|---|
whisper | --stt_ | --stt_model_name distil-whisper/distil-large-v3 |
whisper-mlx | --stt_ | --stt_model_name large-v3 |
mlx-audio-whisper | --mlx_audio_whisper_ | --mlx_audio_whisper_model_name mlx-community/whisper-large-v3-turbo |
faster-whisper | --faster_whisper_stt_ | --faster_whisper_stt_model_name large-v3 |
parakeet-tdt | --parakeet_tdt_ | --parakeet_tdt_language de |
paraformer | --paraformer_stt_ | --paraformer_stt_model_name paraformer-zh |
Per-backend details
whisper — Transformers Whisper
whisper — Transformers Whisper
Handler:
Language detection reads the
WhisperSTTHandlerLoads any Whisper-compatible checkpoint from the Hugging Face Hub via the Transformers library. Supports distil-whisper/distil-large-v3 (default), openai/whisper-large-v3, and any other Whisper checkpoint.Key arguments:| Argument | Default | Description |
|---|---|---|
--stt_model_name | distil-whisper/distil-large-v3 | HuggingFace model ID |
--stt_device | cuda | Device: cuda, cpu, mps |
--stt_torch_dtype | float16 | Precision: float16, bfloat16, float32 |
--stt_compile_mode | None | torch.compile mode: default, reduce-overhead, max-autotune |
--stt_gen_max_new_tokens | 128 | Maximum tokens to generate |
--stt_gen_num_beams | 1 | Number of beams for beam search; 1 = greedy decoding |
--stt_gen_return_timestamps | False | Whether to return timestamps with transcriptions |
--stt_gen_task | transcribe | Task to perform; typically transcribe |
--language | en | Language code or auto for detection |
<\|lang\|> token from the generated IDs. If the detected language is outside the supported list, the handler falls back to the last known language.- English (CUDA)
- Auto-detect
whisper-mlx — Lightning Whisper MLX
whisper-mlx — Lightning Whisper MLX
Handler:
LightningWhisperSTTHandlerUses Lightning Whisper MLX for fast on-device Whisper inference on Apple Silicon. Uses the same --stt_model_name and --language flags as the standard Whisper backend.Language detection falls back to the last supported language when the model returns a code outside the supported list.- Apple Silicon
- Chinese
mlx-audio-whisper — MLX Audio Whisper
mlx-audio-whisper — MLX Audio Whisper
Handler:
MLXAudioWhisperSTTHandlerUses mlx-audio for Whisper inference on Apple Silicon. Model is controlled by --mlx_audio_whisper_model_name; language detection still uses the shared --language flag.Key arguments:| Argument | Default | Description |
|---|---|---|
--mlx_audio_whisper_model_name | mlx-community/whisper-large-v3-turbo | MLX Audio Whisper model ID or local path |
--mlx_audio_whisper_gen_kwargs | {} | Additional generation kwargs passed to the model |
- Default
faster-whisper — CTranslate2 Whisper
faster-whisper — CTranslate2 Whisper
Handler:
FasterWhisperSTTHandlerUses faster-whisper (CTranslate2) for quantized, low-latency Whisper inference on CUDA or CPU. Language is set via the generation kwarg --faster_whisper_stt_gen_language rather than the shared --language flag.Install: pip install "speech-to-speech[faster-whisper]"Key arguments:| Argument | Default | Description |
|---|---|---|
--faster_whisper_stt_model_name | tiny.en | Model: tiny, base, small, medium, large-v3, distil-large-v3, etc. |
--faster_whisper_stt_device | auto | Device: cpu, cuda, auto |
--faster_whisper_stt_compute_type | auto | Quantization: int8, float16, bfloat16, auto, etc. |
--faster_whisper_stt_gen_language | en | Language code for transcription |
--faster_whisper_stt_gen_max_new_tokens | 128 | Max tokens to generate |
--faster_whisper_stt_gen_beam_size | 1 | Number of beams for beam search; 1 = greedy |
--faster_whisper_stt_gen_return_timestamps | False | Whether to return timestamps |
--faster_whisper_stt_gen_task | transcribe | Task to perform; typically transcribe |
- CUDA
- CPU (int8)
parakeet-tdt — NVIDIA Parakeet TDT (default)
parakeet-tdt — NVIDIA Parakeet TDT (default)
Handler:
ParakeetTDTSTTHandlerNVIDIA Parakeet TDT 0.6B v3 is a 600 M-parameter multilingual ASR model supporting 25 European languages. It is the default STT backend and the only one bundled in the standard install.Backend dispatch:- Apple Silicon (MPS): loads
mlx-community/parakeet-tdt-0.6b-v3via mlx-audio. Sub-100 ms latency per utterance. - CUDA / CPU: loads
nvidia/parakeet-tdt-0.6b-v3via nano-parakeet (pure PyTorch, no NeMo dependency).
| Argument | Default | Description |
|---|---|---|
--parakeet_tdt_model_name | auto | Override default model ID |
--parakeet_tdt_device | auto | auto, cuda, mps, cpu |
--parakeet_tdt_compute_type | float16 | float16 or float32 |
--parakeet_tdt_language | None | Fix language; omit for auto-detection |
- Default (auto device)
- With live transcription
- Fixed language
On Apple Silicon, Parakeet TDT shares the MLX execution context with MLX-backed TTS (e.g. Qwen3-TTS or Kokoro MLX). The handlers serialize access via an
MLXLockContext to avoid contention.paraformer — FunASR Paraformer
paraformer — FunASR Paraformer
Handler:
ParaformerSTTHandlerUses FunASR to load Paraformer models. The default model paraformer-zh is optimized for Mandarin Chinese. The handler supports live transcription via PartialTranscription messages.Install: pip install "speech-to-speech[paraformer]"Key arguments:| Argument | Default | Description |
|---|---|---|
--paraformer_stt_model_name | paraformer-zh | FunASR model name or path |
--paraformer_stt_device | cuda | Device for inference |
- Chinese (default)