Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

Speech to Speech is a modular, cascaded pipeline that chains Voice Activity Detection → Speech-to-Text → Language Model → Text-to-Speech into a single CLI command. Run it locally, expose an OpenAI Realtime-compatible WebSocket API, or deploy it as a Docker server — all with open-source models from the Hugging Face Hub.

Introduction

Understand the pipeline architecture and how each component fits together.

Installation

Install via pip with optional extras for your preferred backends.

Quickstart

Launch your first voice agent in under five minutes.

Pipeline Modes

Choose between Realtime, Local, Socket, WebSocket, and Docker modes.

Pipeline at a Glance

The pipeline runs four stages in parallel threads, passing audio and text through queues:
Microphone / Network

  VAD (Silero)           — detects speech boundaries

  STT (Whisper / Parakeet / …)  — transcribes audio to text

  LLM (Transformers / MLX / API) — generates a response

  TTS (Qwen3 / Kokoro / Pocket / …) — synthesizes audio

Speaker / Network
Every component is independently swappable via a single CLI flag.

Key Features

OpenAI Realtime Compatible

Exposes /v1/realtime — connect any OpenAI Realtime client directly to your local pipeline.

Modular Components

Six STT backends, four LLM backends, five TTS backends. Mix and match freely.

Apple Silicon Optimized

First-class MLX support for Parakeet, Whisper, LLMs, and Qwen3-TTS on Apple Silicon.

Multi-language

Supports 25+ languages with automatic per-utterance language detection.

Tool Calling

Function calling works with both local LLMs (via prompt engineering) and OpenAI-compatible APIs.

Barge-in Detection

CancelScope interruption handling lets users speak over the assistant at any time.

Get Started in 3 Steps

1

Install the package

pip install speech-to-speech
2

Set your API key

export OPENAI_API_KEY=your_key_here
3

Run the pipeline

speech-to-speech
The server starts on ws://localhost:8765/v1/realtime. Connect any OpenAI Realtime client or use the built-in local audio mode.

Explore the Docs

Pipeline Components

Deep-dives into VAD, STT, LLM, and TTS — all available backends and their configuration options.

Guides

Apple Silicon setup, multi-language configuration, tool calling, and benchmarking.

CLI Reference

Every flag, default value, and argument class documented with examples.

Realtime API

WebSocket protocol, supported events, session config, and tool calling flow.

Build docs developers (and LLMs) love