TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
scripts/ directory ships two standalone benchmark scripts — benchmark_tts.py and benchmark_stt.py — that measure handler performance independently of the full pipeline. Both scripts run multiple iterations, capture detailed timing statistics, print a sorted comparison table, and save results to a JSON file for further analysis.
TTS benchmarking with benchmark_tts.py
benchmark_tts.py synthesises a fixed text string through one or more TTS handlers, records per-iteration timing, and computes average inference time, min/max/std, average audio duration, real-time factor (RTF), and time-to-first-chunk (TTFC).
Command-line flags
| Flag | Default | Description |
|---|---|---|
--handlers | kokoro qwen3 pocket_tts | Space-separated list of handlers to benchmark |
--iterations | 3 | Number of synthesis passes per handler |
--text | Hello from the speech to speech benchmark. This is a latency test. | Text to synthesise |
--language_code | en | Language code passed to each handler |
--qwen3_mlx_quantizations | (none) | One or more MLX quantization variants to benchmark as separate qwen3[*] entries |
--output | tts_benchmark_results.json | JSON file to write results to |
Comparing Qwen3 MLX quantization variants
Use--qwen3_mlx_quantizations to expand a single qwen3 handler entry into separate benchmark targets — one per quantization level. This is the recommended way to find the best quality/latency trade-off on Apple Silicon before committing to a setting in production:
qwen3[bf16], qwen3[4bit], qwen3[6bit], and qwen3[8bit].
Benchmarking multiple handlers side by side
Compare all available TTS backends in one run:Running a single handler
What the TTS benchmark measures
For each handler and each iteration the script records:- Warmup time — time from handler instantiation to the first successful synthesis (model load + JIT warm-up)
- Inference time — wall-clock time from submitting text to consuming the last audio chunk
- Time to first chunk (TTFC) — latency until the very first audio sample is available (important for streaming perception)
- Audio duration — total length of the synthesised waveform in seconds
- RTF (Real-Time Factor) — audio duration ÷ inference time; values > 1 mean faster-than-real-time
Reading the comparison table
The final comparison table ranks handlers by average inference time and shows relative slowdowns versus the fastest handler:1.00x means that handler is the fastest in the set. Higher ratios indicate proportionally longer inference time relative to the winner.
STT benchmarking with benchmark_stt.py
benchmark_stt.py loads a WAV file and transcribes it through one or more STT handlers, recording inference time, time-to-first-token (TTFT), and the sample transcription text for quality inspection.
Command-line flags
| Flag | Default | Description |
|---|---|---|
--audio_file | (required) | Path to a WAV audio file (16 kHz mono recommended) |
--handlers | whisper whisper-mlx mlx-audio-whisper faster-whisper parakeet-tdt parakeet-tdt-progressive | Space-separated list of handlers |
--iterations | 5 | Number of transcription passes per handler |
--output | stt_benchmark_results.json | JSON file to write results to |
Comparing all STT backends
Running a single backend
Comparing standard vs. progressive Parakeet TDT
parakeet-tdt-progressive enables live transcription (enable_live_transcription=True) with a short update interval. Benchmarking both variants side by side shows the latency cost of the progressive path:
What the STT benchmark measures
For each handler and each iteration the script records:- Warmup time — initial model load time (measured once, excluded from iteration timings; an additional warmup pass on real audio is run before timing begins)
- Inference time — wall-clock time from submitting the audio array to receiving the final transcript
- Time to first token (TTFT) — latency until the first output token or partial transcript arrives (relevant for streaming STT handlers)
- Sample transcription — the transcript from the first recorded iteration, printed for sanity-checking transcription quality
librosa or scipy as a fallback).
Torch compile cache for repeated runs
The pipeline setsTORCHINDUCTOR_CACHE_DIR at startup to cache compiled kernel artefacts between runs. This cache produces roughly a 50% reduction in torch.compile compilation time on subsequent launches. The Whisper STT handler supports torch compile via --stt_compile_mode:
torch.compile traces and compiles the model graph. On every subsequent run with the same model and compile mode, the cached kernels are reused, skipping the compilation step and lowering startup latency significantly.
Saving and analysing results
Both scripts write a timestamped JSON results file:--output to customise the file path: