Local mode runs the entire speech-to-speech pipeline on your machine, reading audio directly from the default microphone and writing generated speech to the default speakers viaDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
sounddevice. There is no TCP socket or WebSocket server — the LocalAudioStreamer manages a bidirectional sounddevice.Stream at 16 kHz, int16, mono (512-sample blocks). This makes local mode the fastest path to a working voice agent on a single machine: no client process, no network, no port configuration.
Starting Local Mode
OPENAI_API_KEY (or --responses_api_api_key) before launching if you are using the default remote LLM backend.
Optimal Settings for Apple Silicon
The--local_mac_optimal_settings flag applies a tuned preset that selects MPS-accelerated models for every stage:
- STT: Parakeet TDT (fast streaming ASR on Apple Silicon)
- LLM: MLX-LM backend
- TTS: Qwen3-TTS (MLX variant, 6-bit quantization by default)
- Device:
--device mpsfor all handlers
Selecting a Compute Device
Use--device to route all handlers to a specific device, or set per-handler device flags:
LLM Backend Examples
- MLX-LM (Apple Silicon)
- Transformers (CUDA/CPU)
- OpenAI (Responses API)
- HF Inference Providers
Fully local inference on Apple Silicon using MLX:
Live Transcription
--enable_live_transcription (enabled by default) streams partial STT hypotheses to the terminal while the user is speaking. Works best with Parakeet TDT, which provides sub-100 ms latency streaming ASR on Apple Silicon.
Multi-Language Support
Pass--language auto to have the STT detect the spoken language on every turn and forward it to the LLM:
whisper-mlx for broader language coverage:
How Local Audio Streaming Works
LocalAudioStreamer opens a single bidirectional sounddevice.Stream at 16 kHz, int16, mono with a block size of 512 samples. The stream callback drives both directions in one call:
- Input path: when the output queue is empty (no assistant audio is playing), the callback copies the raw
int16microphone frame into theinput_queuefor the VAD handler. - Output path: when the output queue has data, the callback pops one chunk and writes it to the speaker output. A static ±1 LSB dither buffer keeps the audio sink active with negligible noise when no audio is queued.
- Re-enabling listening: when the TTS emits an
AUDIO_RESPONSE_DONEsentinel the callback setsshould_listen, allowing the next microphone frame to flow into the VAD.
Local mode is best suited for single-machine use. To stream audio from a separate device or browser, use Server/Client mode, WebSocket mode, or Realtime mode instead.