Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

Speech to Speech is published on PyPI as speech-to-speech and requires Python 3.10 or later. The default install covers the standard realtime voice-agent path: Parakeet TDT for STT, the OpenAI Responses API for the LLM, and Qwen3-TTS for speech output. Optional extras install additional backends without affecting the default configuration.
1
Install the package
2
Default (pip)
The default install bundles all dependencies for the recommended Parakeet TDT + Responses API + Qwen3-TTS pipeline on your platform:
pip install speech-to-speech
On macOS, the MLX stack (mlx, mlx-audio, mlx-lm, mlx-metal, misaki, spacy, and friends) is pulled in automatically via platform markers in pyproject.toml. On Linux / Windows, the Qwen3-TTS GGML backend (faster-qwen3-tts[ggml]) and Parakeet TDT (nano-parakeet) are installed instead.
Linux — CUDA variant
On Linux, faster-qwen3-tts[ggml] ships a qwentts-cpp-python wheel that targets CUDA 12.8 by default. If your machine runs a different CUDA version, install the matching wheel from the Hugging Face wheelhouse before running pip install speech-to-speech:
# CUDA 13.x
pip install "qwentts-cpp-python==0.3.0+cu130" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cu130

# CUDA 12.4
pip install "qwentts-cpp-python==0.3.0+cu124" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cu124

# CPU-only fallback
pip install "qwentts-cpp-python==0.3.0+cpu" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cpu

pip install speech-to-speech
If you want to use the previous CUDA-graphs (PyTorch) implementation instead of GGML, skip the wheel above and pass --qwen3_tts_backend torch at runtime.
Development / source
Clone the repository and use uv to install all dependencies in editable mode. This also makes the speech-to-speech CLI available immediately:
git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech
uv sync
uv sync reads pyproject.toml and resolves platform-specific dependencies automatically — no separate requirements files are needed. The speech_to_speech package is installed in editable mode so local changes take effect without reinstalling.
3
Install optional backend extras
4
Optional extras extend the pipeline with alternative backends. Install them alongside the base package using pip extras syntax:
5
# Kokoro-82M TTS — fast, high-quality synthesis (non-macOS platforms)
pip install "speech-to-speech[kokoro]"

# Pocket TTS from Kyutai Labs — streaming TTS with voice cloning
pip install "speech-to-speech[pocket]"

# ChatTTS — multilingual TTS
pip install "speech-to-speech[chattts]"

# Faster Whisper STT — CTranslate2-based Whisper for accelerated CPU/CUDA transcription
pip install "speech-to-speech[faster-whisper]"

# Paraformer STT — FunASR-based Paraformer for Mandarin and multilingual transcription
pip install "speech-to-speech[paraformer]"

# Lightning Whisper MLX STT — fast Whisper on Apple Silicon (macOS only)
pip install "speech-to-speech[whisper-mlx]"

# MLX LM — explicit MLX LLM backend (already bundled on macOS; use on macOS only)
pip install "speech-to-speech[mlx-lm]"

# WebSocket — explicit websockets dependency for the websocket run mode
pip install "speech-to-speech[websocket]"
6
You can combine multiple extras in one command:
7
pip install "speech-to-speech[faster-whisper,kokoro]"
8
The table below summarises each extra and when to use it:
9
ExtraPackage installedWhen to use[kokoro]kokoro>=0.9.2Alternative TTS on Linux/Windows with high voice quality[pocket]pocket-tts>=0.1.0Streaming TTS with built-in voice cloning from Kyutai Labs[chattts]ChatTTS>=0.1.1Multilingual TTS with ChatTTS[faster-whisper]faster-whisper>=1.0.3CTranslate2 Whisper for fast CPU or CUDA STT[paraformer]funasr, modelscope, onnxruntimeFunASR Paraformer for Mandarin-primary transcription[whisper-mlx]lightning-whisper-mlx>=0.0.10Lightning Whisper MLX for fast Whisper on Apple Silicon (macOS only)[mlx-lm]mlx-lm==0.31.1, mlx-vlm (macOS only)Explicit MLX LM install on macOS[websocket]websockets>=12.0Explicit websockets install for --mode websocket
10
Set your API key
11
The default pipeline uses the OpenAI Responses API for the LLM stage. Export your API key before launching:
12
export OPENAI_API_KEY=your_key_here
13
For alternative providers (Hugging Face Inference Providers, OpenRouter, vLLM, llama.cpp), see the LLM Backend section in the README and pass --responses_api_base_url and --responses_api_api_key accordingly.

Platform Notes

The recommended Linux setup leverages the GGML backend for Qwen3-TTS and Parakeet TDT for STT. If you are on CUDA 12.8, a plain pip install speech-to-speech is sufficient. For other CUDA versions, pre-install the matching qwentts-cpp-python wheel as shown in the Linux — CUDA variant tab above before installing the package.To verify GPU availability:
import torch
print(torch.cuda.is_available())  # Should print True
print(torch.cuda.get_device_name(0))

Python Version Requirements

Speech to Speech requires Python 3.10, 3.11, or 3.12. Python 3.9 and below are not supported. Check your Python version with:
python --version

Known Conflicts

DeepFilterNet and Pocket TTS cannot be installed in the same environment.DeepFilterNet (optional audio enhancement for VAD) requires numpy<2. Pocket TTS ([pocket] extra) requires numpy>=2. Installing both in the same virtual environment will cause a dependency conflict.Install DeepFilterNet manually only in environments where you are not using Pocket TTS.

Build docs developers (and LLMs) love