What is WhisperKit?
WhisperKit brings powerful speech processing capabilities directly to Apple devices:- WhisperKit: Speech-to-text powered by OpenAI’s Whisper models, optimized for on-device inference
- TTSKit: Text-to-speech using Qwen3 TTS models for natural-sounding voice synthesis
Key Features
WhisperKit (Speech Recognition)
- Real-time streaming transcription with live audio input
- Word-level timestamps for precise alignment
- Voice activity detection (VAD) to optimize processing
- Multilingual support with automatic language detection
- Flexible model selection from tiny to large-v3 variants
- Custom model deployment via HuggingFace integration
- OpenAI-compatible API server for easy integration
TTSKit (Text-to-Speech)
- Real-time streaming playback as audio generates
- 9 built-in voices with natural prosody
- 10 language support including English, Chinese, Japanese, Korean
- Style instructions for controlling speech characteristics (1.7B model)
- Automatic text chunking for long-form content
- Concurrent generation for improved throughput
- Prompt caching for 90% faster subsequent generations
Use Cases
Live Transcription
Real-time meeting transcription, lecture notes, accessibility features
Voice Commands
On-device voice control with low latency and full privacy
Content Creation
Generate narration, audiobooks, and voiceovers without cloud services
Accessibility
Text-to-speech for screen readers, speech-to-text for hearing assistance
Platform Support
WhisperKit
- iOS 16.0+
- macOS 13.0+
- watchOS 10.0+
- visionOS 1.0+
TTSKit
- iOS 18.0+
- macOS 15.0+
Architecture
WhisperKit uses a modular, protocol-based architecture where each component can be swapped:Model Variants
WhisperKit Models
Models are automatically recommended based on your device capabilities:tiny- Fastest, lowest accuracy (~150 MB)base- Good balance for mobile devices (~290 MB)small- Better accuracy, still fast (~967 MB)medium- High accuracy (~3.1 GB)large-v3- Best accuracy (~6.2 GB)distil-large-v3- Distilled version with similar accuracy to large-v3 but faster
TTSKit Models
qwen3TTS_0_6b- Fast, runs on all platforms (~1 GB)qwen3TTS_1_7b- Higher quality, macOS only, supports style instructions (~2.2 GB)
Performance
WhisperKit is optimized for Apple Silicon with:- Neural Engine acceleration for maximum performance
- Automatic compute unit selection based on device capabilities
- Model specialization for reduced latency on first run
- Efficient memory usage with streaming processing
For production deployments requiring real-time transcription and speaker diarization at scale, consider Argmax Pro SDK with 9x faster models and advanced features.
Open Source
WhisperKit is open source under the MIT License:- GitHub Repository
- Python Tools for model conversion
- Benchmarks & Device Support
- Discord Community
Next Steps
Installation
Add WhisperKit to your Xcode project
Quick Start
Get started with your first transcription and TTS generation