Speech to Speech: Build Local Voice Agents

Speech to Speech is a modular, cascaded pipeline that chains Voice Activity Detection → Speech-to-Text → Language Model → Text-to-Speech into a single CLI command. Run it locally, expose an OpenAI Realtime-compatible WebSocket API, or deploy it as a Docker server — all with open-source models from the Hugging Face Hub.

Introduction

Understand the pipeline architecture and how each component fits together.

Installation

Install via pip with optional extras for your preferred backends.

Quickstart

Launch your first voice agent in under five minutes.

Pipeline Modes

Choose between Realtime, Local, Socket, WebSocket, and Docker modes.

Pipeline at a Glance

The pipeline runs four stages in parallel threads, passing audio and text through queues:

Microphone / Network
       ↓
  VAD (Silero)           — detects speech boundaries
       ↓
  STT (Whisper / Parakeet / …)  — transcribes audio to text
       ↓
  LLM (Transformers / MLX / API) — generates a response
       ↓
  TTS (Qwen3 / Kokoro / Pocket / …) — synthesizes audio
       ↓
Speaker / Network

Every component is independently swappable via a single CLI flag.

Key Features

OpenAI Realtime Compatible

Exposes /v1/realtime — connect any OpenAI Realtime client directly to your local pipeline.

Modular Components

Six STT backends, four LLM backends, five TTS backends. Mix and match freely.

Apple Silicon Optimized

First-class MLX support for Parakeet, Whisper, LLMs, and Qwen3-TTS on Apple Silicon.

Multi-language

Supports 25+ languages with automatic per-utterance language detection.

Tool Calling

Function calling works with both local LLMs (via prompt engineering) and OpenAI-compatible APIs.

Barge-in Detection

CancelScope interruption handling lets users speak over the assistant at any time.

Get Started in 3 Steps

Install the package

pip install speech-to-speech

Set your API key

export OPENAI_API_KEY=your_key_here

Run the pipeline

speech-to-speech

The server starts on ws://localhost:8765/v1/realtime. Connect any OpenAI Realtime client or use the built-in local audio mode.

Explore the Docs

Pipeline Components

Deep-dives into VAD, STT, LLM, and TTS — all available backends and their configuration options.

Guides

Apple Silicon setup, multi-language configuration, tool calling, and benchmarking.

CLI Reference

Every flag, default value, and argument class documented with examples.

Realtime API

WebSocket protocol, supported events, session config, and tool calling flow.

Get Started

Pipeline Modes

Pipeline Components

Guides

Documentation Index

Introduction

Installation

Quickstart

Pipeline Modes

​Pipeline at a Glance

​Key Features

OpenAI Realtime Compatible

Modular Components

Apple Silicon Optimized

Multi-language

Tool Calling

Barge-in Detection

​Get Started in 3 Steps

​Explore the Docs

Pipeline Components

Guides

CLI Reference

Realtime API

Build docs developers (and LLMs) love

Pipeline at a Glance

Key Features

Get Started in 3 Steps

Explore the Docs