Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

The STT stage converts raw audio frames into text that is forwarded to the LLM. Five backends are available, each selected via --stt <value> on ModuleArguments. Every backend uses its own flag prefix so that all argument classes can coexist in the same namespace. Select a backend, then use the corresponding flags:
# Parakeet TDT (default)
speech-to-speech --stt parakeet-tdt --parakeet_tdt_device auto

# Whisper (Transformers)
speech-to-speech --stt whisper --stt_model_name openai/whisper-large-v3

# Faster-Whisper
speech-to-speech --stt faster-whisper --faster_whisper_stt_model_name large-v3

# Paraformer
speech-to-speech --stt paraformer --paraformer_stt_model_name paraformer-zh

# MLX Audio Whisper (Apple Silicon)
speech-to-speech --stt mlx-audio-whisper --mlx_audio_whisper_model_name mlx-community/whisper-large-v3-turbo
Prefix: --stt_
Backend value: --stt whisper
Uses any Whisper checkpoint available on the Hugging Face Hub through the Transformers library, including openai/whisper-large-v3 and distil-whisper/distil-large-v3.
stt_model_name
string
default:"distil-whisper/distil-large-v3"
The Hugging Face Hub model ID of the Whisper checkpoint to load. Any compatible checkpoint works, including distilled variants.
speech-to-speech --stt whisper --stt_model_name openai/whisper-large-v3
stt_device
string
default:"cuda"
Device to run the Whisper model on. Set to cpu for CPU-only inference or mps for Apple Silicon.
speech-to-speech --stt whisper --stt_device cpu
stt_torch_dtype
string
default:"float16"
PyTorch data type for model weights and activations. One of float32 (full precision), float16, or bfloat16 (both half precision). Use float32 on CPU.
speech-to-speech --stt whisper --stt_torch_dtype bfloat16
stt_compile_mode
string
Torch compile mode. One of default, reduce-overhead, or max-autotune. When unset (default), compilation is disabled. reduce-overhead typically gives the best latency reduction for streaming inference.
speech-to-speech --stt whisper --stt_compile_mode reduce-overhead
stt_gen_max_new_tokens
integer
default:"128"
Maximum number of new tokens to generate per transcription call. Raise this for very long utterances.
speech-to-speech --stt whisper --stt_gen_max_new_tokens 256
stt_gen_num_beams
integer
default:"1"
Number of beams for beam search. The default 1 uses greedy decoding, which is fastest. Increase to improve accuracy at the cost of latency.
speech-to-speech --stt whisper --stt_gen_num_beams 4
stt_gen_return_timestamps
boolean
default:"false"
Whether to include word-level or segment-level timestamps in the transcription output.
speech-to-speech --stt whisper --stt_gen_return_timestamps
stt_gen_task
string
default:"transcribe"
The generation task. Use transcribe to output text in the source language, or translate to output English regardless of the input language.
speech-to-speech --stt whisper --stt_gen_task translate
language
string
default:"en"
BCP-47 language code for transcription. Set to auto to let Whisper detect the language dynamically each utterance. Supported codes include en, fr, es, zh, ko, ja, hi.
speech-to-speech --stt whisper --language auto

Multi-language usage

All Whisper-based backends support --language auto for dynamic language detection. Parakeet TDT auto-detects across its 25 supported European languages when --parakeet_tdt_language is omitted. Paraformer is best suited for Mandarin with --paraformer_stt_model_name paraformer-zh.
# Whisper with automatic language detection
speech-to-speech --stt whisper --language auto \
    --stt_model_name openai/whisper-large-v3

# Force Chinese with Whisper
speech-to-speech --stt whisper --language zh \
    --stt_model_name openai/whisper-large-v3
STT and LLM checkpoints must be compatible with your target language(s). For multilingual TTS output, pair with ChatTTS or another backend that covers your target language.

Build docs developers (and LLMs) love