Apple Silicon Macs ship with a unified memory architecture that MLX is purpose-built to exploit. By routing the STT, LLM, and TTS stages entirely through the Metal Performance Shaders (MPS) device and theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
mlx-audio / mlx-lm libraries, you can run a fully local, low-latency voice agent without any cloud dependency or CUDA hardware.
Quick-start with --local_mac_optimal_settings
The --local_mac_optimal_settings flag applies every Apple-Silicon-specific override in one shot so you do not have to remember the individual flags:
What the flag sets
| Setting | Value | Notes |
|---|---|---|
--device | mps | All handlers default to the Metal GPU |
--stt | parakeet-tdt | MLX backend via mlx-community/parakeet-tdt-0.6b-v3 |
--llm_backend | mlx-lm | Pure MLX LLM inference |
--tts | qwen3 | mlx-audio backend, defaults to 6bit quantization |
--mode | local | Local audio I/O (microphone + speakers) |
Full equivalent expansion
The one-liner above is exactly equivalent to:If you accidentally pass
--device cuda on macOS the pipeline raises a ValueError immediately: Cannot use CUDA on macOS. Please set the device to 'cpu' or 'mps'. Use --device mps or omit --device entirely and let --local_mac_optimal_settings set it for you.TTS options on macOS
Three TTS backends are actively supported on Apple Silicon. Qwen3-TTS is the default; Pocket TTS and Kokoro are opt-in alternatives.- Qwen3 (default)
- Pocket TTS
- Kokoro
Qwen3-TTS uses the
mlx-audio backend on macOS and streams audio in real time. The --local_mac_optimal_settings shortcut selects it automatically.Qwen3-TTS MLX quantization options
On Apple Silicon theQwen/* model ID is automatically mapped to the matching mlx-community/* MLX variant. The default quantization is 6bit, which offers a good balance between quality and memory footprint. Use --qwen3_tts_mlx_quantization to override it.
| Quantization | Memory | Notes |
|---|---|---|
bf16 | Highest | Full precision; best quality |
8bit | High | Near-lossless |
6bit | Medium | Default — recommended for most M-series chips |
4bit | Low | Smallest model; audible quality drop on longer sentences |
Selecting the MLX LLM model
The default MLX LLM ismlx-community/Qwen3-4B-Instruct-2507-bf16. Any mlx-community model on the Hugging Face Hub can be swapped in via --model_name:
mlx-community models when --llm_backend mlx-lm is active. The mlx-lm extra is required if not installed already:
The global MLX lock and --num_pipelines
MLX models (STT, LLM, TTS) cannot run concurrently from multiple threads on Apple Silicon because Metal command buffers are not re-entrant. The pipeline manages this via a global reentrant lock (mlx_lock.py). Each handler acquires the lock before running inference and releases it immediately after.
When you run more than one pipeline in parallel with --num_pipelines > 1, the progressive STT path (live transcription) competes heavily for the same MLX lock. This produces a flood of contention warnings without affecting final transcripts. The pipeline detects this situation at startup and automatically disables live transcription on macOS when --num_pipelines > 1:
--num_pipelines 1 (the default). If you need multiple concurrent sessions, accept that live transcription will be disabled:
Running in realtime mode on Apple Silicon
Realtime mode exposes an OpenAI Realtime-compatible WebSocket endpoint at/v1/realtime. Connect to it from any OpenAI Realtime-compatible client:
Benchmarking Qwen3-TTS MLX quantization variants
Usebenchmark_tts.py to measure latency and real-time factor (RTF) for each quantization level on your specific hardware before committing to a setting in production:
tts_benchmark_results.json for further analysis.