Speech to Speech is a modular, cascaded pipeline that chains Voice Activity Detection → Speech-to-Text → Language Model → Text-to-Speech into a single CLI command. Run it locally, expose an OpenAI Realtime-compatible WebSocket API, or deploy it as a Docker server — all with open-source models from the Hugging Face Hub.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
Understand the pipeline architecture and how each component fits together.
Installation
Install via pip with optional extras for your preferred backends.
Quickstart
Launch your first voice agent in under five minutes.
Pipeline Modes
Choose between Realtime, Local, Socket, WebSocket, and Docker modes.
Pipeline at a Glance
The pipeline runs four stages in parallel threads, passing audio and text through queues:Key Features
OpenAI Realtime Compatible
Exposes
/v1/realtime — connect any OpenAI Realtime client directly to your local pipeline.Modular Components
Six STT backends, four LLM backends, five TTS backends. Mix and match freely.
Apple Silicon Optimized
First-class MLX support for Parakeet, Whisper, LLMs, and Qwen3-TTS on Apple Silicon.
Multi-language
Supports 25+ languages with automatic per-utterance language detection.
Tool Calling
Function calling works with both local LLMs (via prompt engineering) and OpenAI-compatible APIs.
Barge-in Detection
CancelScope interruption handling lets users speak over the assistant at any time.
Get Started in 3 Steps
Explore the Docs
Pipeline Components
Deep-dives into VAD, STT, LLM, and TTS — all available backends and their configuration options.
Guides
Apple Silicon setup, multi-language configuration, tool calling, and benchmarking.
CLI Reference
Every flag, default value, and argument class documented with examples.
Realtime API
WebSocket protocol, supported events, session config, and tool calling flow.