Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt

Use this file to discover all available pages before exploring further.

Local mode runs the entire speech-to-speech pipeline on your machine, reading audio directly from the default microphone and writing generated speech to the default speakers via sounddevice. There is no TCP socket or WebSocket server — the LocalAudioStreamer manages a bidirectional sounddevice.Stream at 16 kHz, int16, mono (512-sample blocks). This makes local mode the fastest path to a working voice agent on a single machine: no client process, no network, no port configuration.

Starting Local Mode

speech-to-speech --mode local
The pipeline starts with the default STT (Parakeet TDT), LLM backend (Responses API), and TTS (Qwen3). Set OPENAI_API_KEY (or --responses_api_api_key) before launching if you are using the default remote LLM backend.

Optimal Settings for Apple Silicon

The --local_mac_optimal_settings flag applies a tuned preset that selects MPS-accelerated models for every stage:
  • STT: Parakeet TDT (fast streaming ASR on Apple Silicon)
  • LLM: MLX-LM backend
  • TTS: Qwen3-TTS (MLX variant, 6-bit quantization by default)
  • Device: --device mps for all handlers
speech-to-speech --local_mac_optimal_settings
You can override any individual setting while keeping the rest of the preset:
speech-to-speech \
    --local_mac_optimal_settings \
    --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16
On Apple Silicon, --tts pocket and --tts kokoro are both valid TTS alternatives to qwen3. Pocket TTS provides voice cloning; Kokoro-82M is a fast, high-quality option. Note that Pocket TTS requires numpy>=2 and conflicts with DeepFilterNet, which requires numpy<2.

Selecting a Compute Device

Use --device to route all handlers to a specific device, or set per-handler device flags:
# Apple Silicon (MPS)
speech-to-speech --mode local --device mps

# NVIDIA GPU
speech-to-speech --mode local --device cuda

# CPU-only
speech-to-speech --mode local --device cpu

LLM Backend Examples

Fully local inference on Apple Silicon using MLX:
speech-to-speech \
    --mode local \
    --stt parakeet-tdt \
    --llm_backend mlx-lm \
    --tts qwen3 \
    --qwen3_tts_mlx_quantization 6bit \
    --model_name "mlx-community/Qwen3-4B-Instruct-2507-bf16" \
    --enable_live_transcription

Live Transcription

--enable_live_transcription (enabled by default) streams partial STT hypotheses to the terminal while the user is speaking. Works best with Parakeet TDT, which provides sub-100 ms latency streaming ASR on Apple Silicon.
speech-to-speech --mode local --enable_live_transcription

Multi-Language Support

Pass --language auto to have the STT detect the spoken language on every turn and forward it to the LLM:
speech-to-speech \
    --local_mac_optimal_settings \
    --stt parakeet-tdt \
    --language auto \
    --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16
For a fixed language, pass the language code directly. Example for Chinese using whisper-mlx for broader language coverage:
speech-to-speech \
    --local_mac_optimal_settings \
    --stt whisper-mlx \
    --stt_model_name large-v3 \
    --language zh \
    --model_name mlx-community/Qwen3-4B-Instruct-2507-bf16

How Local Audio Streaming Works

LocalAudioStreamer opens a single bidirectional sounddevice.Stream at 16 kHz, int16, mono with a block size of 512 samples. The stream callback drives both directions in one call:
  • Input path: when the output queue is empty (no assistant audio is playing), the callback copies the raw int16 microphone frame into the input_queue for the VAD handler.
  • Output path: when the output queue has data, the callback pops one chunk and writes it to the speaker output. A static ±1 LSB dither buffer keeps the audio sink active with negligible noise when no audio is queued.
  • Re-enabling listening: when the TTS emits an AUDIO_RESPONSE_DONE sentinel the callback sets should_listen, allowing the next microphone frame to flow into the VAD.
Local mode is best suited for single-machine use. To stream audio from a separate device or browser, use Server/Client mode, WebSocket mode, or Realtime mode instead.

Build docs developers (and LLMs) love