Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

The speech-to-speech pipeline supports six Speech-to-Text backends that cover the full spectrum from cloud-free local inference on Apple Silicon to high-accuracy multilingual transcription on CUDA. Select a backend with --stt; all other STT parameters follow an --<handler_prefix>_* naming convention described below.

Backend selection

--stt valueHandler classBest for
whisperWhisperSTTHandlerAny Whisper checkpoint on Hugging Face Hub via Transformers
whisper-mlxLightningWhisperSTTHandlerFast Whisper inference on Apple Silicon via Lightning Whisper MLX
mlx-audio-whisperMLXAudioWhisperSTTHandlerFast Whisper on Apple Silicon via mlx-audio
faster-whisperFasterWhisperSTTHandlerCTranslate2-accelerated Whisper on CUDA/CPU
parakeet-tdtParakeetTDTSTTHandlerDefault; streaming ASR on Apple Silicon (MLX) and CUDA (nano-parakeet)
paraformerParaformerSTTHandlerChinese-optimized FunASR Paraformer model
The default backend is parakeet-tdt. It is the only STT backend included in the standard pip install speech-to-speech install. All other backends require an optional extra or a separate install.

Language support matrix

BackendSupported languagesAuto-detect
whisperen, fr, es, zh, ja, ko, hi, de, pt, pl, it, nl (+ fallback)✅ (--language auto)
whisper-mlxen, fr, es, zh, ja, ko, hi, de, pt, pl, it, nl
mlx-audio-whisperen, fr, es, zh, ja, ko, hi, de, pt, pl, it, nl
faster-whisperDepends on checkpoint; gen arg --faster_whisper_stt_gen_languageDepends on model
parakeet-tdt25 European languages (en, de, fr, es, it, pt, nl, pl, ru, uk, cs, sk, hu, ro, bg, hr, sl, sr, da, no, sv, fi, et, lv, lt)✅ (lingua-py)
paraformerDepends on FunASR checkpoint; default paraformer-zh is Chinese
Use --language <code> to fix the language or --language auto to enable per-utterance language detection. Pass --enable_lang_prompt to append a "Please reply to my message in <language>" instruction to the LLM so smaller models stay in the detected language.

Live transcription flags

Two flags enable in-stream partial transcription (Parakeet TDT and Paraformer support this natively):
--enable_live_transcription           # emit PartialTranscription messages during speech
--live_transcription_update_interval  # seconds between progressive STT calls (default 0.5)
When --enable_live_transcription is set, the VAD stage emits VADAudio(mode="progressive") chunks at --realtime_processing_pause intervals, the STT handler transcribes them, and the realtime server forwards partial transcripts as conversation.item.input_audio_transcription.delta WebSocket events.

Argument prefix pattern

STT parameters follow the --<handler_prefix>_* naming convention, with generation parameters under --<handler_prefix>_gen_*. The prefix for each backend is:
BackendArgument prefixExample
whisper--stt_--stt_model_name distil-whisper/distil-large-v3
whisper-mlx--stt_--stt_model_name large-v3
mlx-audio-whisper--mlx_audio_whisper_--mlx_audio_whisper_model_name mlx-community/whisper-large-v3-turbo
faster-whisper--faster_whisper_stt_--faster_whisper_stt_model_name large-v3
parakeet-tdt--parakeet_tdt_--parakeet_tdt_language de
paraformer--paraformer_stt_--paraformer_stt_model_name paraformer-zh

Per-backend details

Handler: WhisperSTTHandlerLoads any Whisper-compatible checkpoint from the Hugging Face Hub via the Transformers library. Supports distil-whisper/distil-large-v3 (default), openai/whisper-large-v3, and any other Whisper checkpoint.Key arguments:
ArgumentDefaultDescription
--stt_model_namedistil-whisper/distil-large-v3HuggingFace model ID
--stt_devicecudaDevice: cuda, cpu, mps
--stt_torch_dtypefloat16Precision: float16, bfloat16, float32
--stt_compile_modeNonetorch.compile mode: default, reduce-overhead, max-autotune
--stt_gen_max_new_tokens128Maximum tokens to generate
--stt_gen_num_beams1Number of beams for beam search; 1 = greedy decoding
--stt_gen_return_timestampsFalseWhether to return timestamps with transcriptions
--stt_gen_tasktranscribeTask to perform; typically transcribe
--languageenLanguage code or auto for detection
Language detection reads the <\|lang\|> token from the generated IDs. If the detected language is outside the supported list, the handler falls back to the last known language.
speech-to-speech \
    --stt whisper \
    --stt_model_name distil-whisper/distil-large-v3 \
    --stt_device cuda \
    --language en
Handler: LightningWhisperSTTHandlerUses Lightning Whisper MLX for fast on-device Whisper inference on Apple Silicon. Uses the same --stt_model_name and --language flags as the standard Whisper backend.Language detection falls back to the last supported language when the model returns a code outside the supported list.
speech-to-speech \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language auto \
    --device mps
Handler: MLXAudioWhisperSTTHandlerUses mlx-audio for Whisper inference on Apple Silicon. Model is controlled by --mlx_audio_whisper_model_name; language detection still uses the shared --language flag.Key arguments:
ArgumentDefaultDescription
--mlx_audio_whisper_model_namemlx-community/whisper-large-v3-turboMLX Audio Whisper model ID or local path
--mlx_audio_whisper_gen_kwargs{}Additional generation kwargs passed to the model
speech-to-speech \
    --stt mlx-audio-whisper \
    --mlx_audio_whisper_model_name mlx-community/whisper-large-v3-turbo \
    --language auto
Handler: FasterWhisperSTTHandlerUses faster-whisper (CTranslate2) for quantized, low-latency Whisper inference on CUDA or CPU. Language is set via the generation kwarg --faster_whisper_stt_gen_language rather than the shared --language flag.Install: pip install "speech-to-speech[faster-whisper]"Key arguments:
ArgumentDefaultDescription
--faster_whisper_stt_model_nametiny.enModel: tiny, base, small, medium, large-v3, distil-large-v3, etc.
--faster_whisper_stt_deviceautoDevice: cpu, cuda, auto
--faster_whisper_stt_compute_typeautoQuantization: int8, float16, bfloat16, auto, etc.
--faster_whisper_stt_gen_languageenLanguage code for transcription
--faster_whisper_stt_gen_max_new_tokens128Max tokens to generate
--faster_whisper_stt_gen_beam_size1Number of beams for beam search; 1 = greedy
--faster_whisper_stt_gen_return_timestampsFalseWhether to return timestamps
--faster_whisper_stt_gen_tasktranscribeTask to perform; typically transcribe
speech-to-speech \
    --stt faster-whisper \
    --faster_whisper_stt_model_name large-v3 \
    --faster_whisper_stt_device cuda \
    --faster_whisper_stt_compute_type float16 \
    --faster_whisper_stt_gen_language en
Handler: ParakeetTDTSTTHandlerNVIDIA Parakeet TDT 0.6B v3 is a 600 M-parameter multilingual ASR model supporting 25 European languages. It is the default STT backend and the only one bundled in the standard install.Backend dispatch:
  • Apple Silicon (MPS): loads mlx-community/parakeet-tdt-0.6b-v3 via mlx-audio. Sub-100 ms latency per utterance.
  • CUDA / CPU: loads nvidia/parakeet-tdt-0.6b-v3 via nano-parakeet (pure PyTorch, no NeMo dependency).
Language detection uses lingua-py on the transcribed text (text-based, not acoustic). For utterances shorter than 20 characters, lingua-py is skipped and the last known language is reused.Key arguments:
ArgumentDefaultDescription
--parakeet_tdt_model_nameautoOverride default model ID
--parakeet_tdt_deviceautoauto, cuda, mps, cpu
--parakeet_tdt_compute_typefloat16float16 or float32
--parakeet_tdt_languageNoneFix language; omit for auto-detection
speech-to-speech \
    --stt parakeet-tdt
On Apple Silicon, Parakeet TDT shares the MLX execution context with MLX-backed TTS (e.g. Qwen3-TTS or Kokoro MLX). The handlers serialize access via an MLXLockContext to avoid contention.
Handler: ParaformerSTTHandlerUses FunASR to load Paraformer models. The default model paraformer-zh is optimized for Mandarin Chinese. The handler supports live transcription via PartialTranscription messages.Install: pip install "speech-to-speech[paraformer]"Key arguments:
ArgumentDefaultDescription
--paraformer_stt_model_nameparaformer-zhFunASR model name or path
--paraformer_stt_devicecudaDevice for inference
speech-to-speech \
    --stt paraformer \
    --paraformer_stt_model_name paraformer-zh

Multi-language pipeline example

# Automatic language detection with Parakeet on Mac, MLX LLM
speech-to-speech \
    --local_mac_optimal_settings \
    --stt parakeet-tdt \
    --language auto \
    --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16 \
    --enable_lang_prompt

# Override to Whisper large-v3 for broader language coverage (Chinese)
speech-to-speech \
    --local_mac_optimal_settings \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language zh \
    --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

Build docs developers (and LLMs) love