Install Speech to Speech

Speech to Speech is published on PyPI as speech-to-speech and requires Python 3.10 or later. The default install covers the standard realtime voice-agent path: Parakeet TDT for STT, the OpenAI Responses API for the LLM, and Qwen3-TTS for speech output. Optional extras install additional backends without affecting the default configuration.

Install the package

Default (pip)

The default install bundles all dependencies for the recommended Parakeet TDT + Responses API + Qwen3-TTS pipeline on your platform:

pip install speech-to-speech

On macOS, the MLX stack (mlx, mlx-audio, mlx-lm, mlx-metal, misaki, spacy, and friends) is pulled in automatically via platform markers in pyproject.toml. On Linux / Windows, the Qwen3-TTS GGML backend (faster-qwen3-tts[ggml]) and Parakeet TDT (nano-parakeet) are installed instead.

Linux — CUDA variant

On Linux, faster-qwen3-tts[ggml] ships a qwentts-cpp-python wheel that targets CUDA 12.8 by default. If your machine runs a different CUDA version, install the matching wheel from the Hugging Face wheelhouse before running pip install speech-to-speech:

# CUDA 13.x
pip install "qwentts-cpp-python==0.3.0+cu130" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cu130

# CUDA 12.4
pip install "qwentts-cpp-python==0.3.0+cu124" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cu124

# CPU-only fallback
pip install "qwentts-cpp-python==0.3.0+cpu" \
  -f https://huggingface.co/datasets/andito/qwentts-cpp-python-wheels/tree/main/whl/cpu

pip install speech-to-speech

If you want to use the previous CUDA-graphs (PyTorch) implementation instead of GGML, skip the wheel above and pass --qwen3_tts_backend torch at runtime.

Development / source

Clone the repository and use uv to install all dependencies in editable mode. This also makes the speech-to-speech CLI available immediately:

git clone https://github.com/huggingface/speech-to-speech.git
cd speech-to-speech
uv sync

uv sync reads pyproject.toml and resolves platform-specific dependencies automatically — no separate requirements files are needed. The speech_to_speech package is installed in editable mode so local changes take effect without reinstalling.

Install optional backend extras

Optional extras extend the pipeline with alternative backends. Install them alongside the base package using pip extras syntax:

# Kokoro-82M TTS — fast, high-quality synthesis (non-macOS platforms)
pip install "speech-to-speech[kokoro]"

# Pocket TTS from Kyutai Labs — streaming TTS with voice cloning
pip install "speech-to-speech[pocket]"

# ChatTTS — multilingual TTS
pip install "speech-to-speech[chattts]"

# Faster Whisper STT — CTranslate2-based Whisper for accelerated CPU/CUDA transcription
pip install "speech-to-speech[faster-whisper]"

# Paraformer STT — FunASR-based Paraformer for Mandarin and multilingual transcription
pip install "speech-to-speech[paraformer]"

# Lightning Whisper MLX STT — fast Whisper on Apple Silicon (macOS only)
pip install "speech-to-speech[whisper-mlx]"

# MLX LM — explicit MLX LLM backend (already bundled on macOS; use on macOS only)
pip install "speech-to-speech[mlx-lm]"

# WebSocket — explicit websockets dependency for the websocket run mode
pip install "speech-to-speech[websocket]"

You can combine multiple extras in one command:

pip install "speech-to-speech[faster-whisper,kokoro]"

The table below summarises each extra and when to use it:

ExtraPackage installedWhen to use[kokoro]kokoro>=0.9.2Alternative TTS on Linux/Windows with high voice quality[pocket]pocket-tts>=0.1.0Streaming TTS with built-in voice cloning from Kyutai Labs[chattts]ChatTTS>=0.1.1Multilingual TTS with ChatTTS[faster-whisper]faster-whisper>=1.0.3CTranslate2 Whisper for fast CPU or CUDA STT[paraformer]funasr, modelscope, onnxruntimeFunASR Paraformer for Mandarin-primary transcription[whisper-mlx]lightning-whisper-mlx>=0.0.10Lightning Whisper MLX for fast Whisper on Apple Silicon (macOS only)[mlx-lm]mlx-lm==0.31.1, mlx-vlm (macOS only)Explicit MLX LM install on macOS[websocket]websockets>=12.0Explicit websockets install for --mode websocket

Set your API key

The default pipeline uses the OpenAI Responses API for the LLM stage. Export your API key before launching:

export OPENAI_API_KEY=your_key_here

For alternative providers (Hugging Face Inference Providers, OpenRouter, vLLM, llama.cpp), see the LLM Backend section in the README and pass --responses_api_base_url and --responses_api_api_key accordingly.

Platform Notes

Linux (CUDA)
macOS (Apple Silicon)
Development

The recommended Linux setup leverages the GGML backend for Qwen3-TTS and Parakeet TDT for STT. If you are on CUDA 12.8, a plain pip install speech-to-speech is sufficient. For other CUDA versions, pre-install the matching qwentts-cpp-python wheel as shown in the Linux — CUDA variant tab above before installing the package.To verify GPU availability:

import torch
print(torch.cuda.is_available())  # Should print True
print(torch.cuda.get_device_name(0))

On macOS, the full MLX stack is installed automatically with the base package — no extras are needed for local LLM and TTS inference on Apple Silicon. The --local_mac_optimal_settings flag selects the best backends for M-series chips in one go:

speech-to-speech --local_mac_optimal_settings

This sets:

--device mps for all models
--stt parakeet-tdt (Parakeet TDT via nano-parakeet)
--llm_backend mlx-lm (MLX LM for local LLM inference)
--tts qwen3 (Qwen3-TTS via mlx-audio)

--tts pocket and --tts kokoro are also valid TTS options on macOS.

The dev dependency group (ruff, mypy, pytest, pytest-asyncio) is installed automatically by uv sync. To run the test suite:

uv run pytest

To lint:

uv run ruff check src/
uv run mypy src/

The project uses a single pyproject.toml for both production and dev dependencies; no separate dev-requirements.txt is needed.

Python Version Requirements

Speech to Speech requires Python 3.10, 3.11, or 3.12. Python 3.9 and below are not supported. Check your Python version with:

python --version

Known Conflicts

DeepFilterNet and Pocket TTS cannot be installed in the same environment.DeepFilterNet (optional audio enhancement for VAD) requires numpy<2. Pocket TTS ([pocket] extra) requires numpy>=2. Installing both in the same virtual environment will cause a dependency conflict.Install DeepFilterNet manually only in environments where you are not using Pocket TTS.

Get Started

Pipeline Modes

Pipeline Components

Guides

Platform Notes

Python Version Requirements

Known Conflicts

Build docs developers (and LLMs) love

Get Started

Pipeline Modes

Pipeline Components

Guides

Documentation Index

​Platform Notes

​Python Version Requirements

​Known Conflicts

Build docs developers (and LLMs) love

Platform Notes

Python Version Requirements

Known Conflicts