WhisperKit Overview
WhisperKit is a Swift framework that brings OpenAI’s Whisper speech recognition models to Apple devices. It leverages Core ML for efficient on-device inference, enabling privacy-focused, low-latency speech-to-text capabilities.Key Features
On-Device Processing
All transcription happens locally on the device, ensuring user privacy and enabling offline functionality.
Multiple Model Sizes
Choose from tiny, base, small, medium, and large variants to balance accuracy and performance.
Multilingual Support
Supports 99+ languages with automatic language detection for multilingual models.
Real-time Streaming
Stream audio from the microphone and get transcriptions in real-time.
Core Components
WhisperKit consists of several key components:WhisperKit Class
The main entry point for speech-to-text functionality. It orchestrates all the components needed for transcription.Audio Processing Pipeline
- AudioProcessor - Captures and processes audio input
- FeatureExtractor - Converts audio to mel spectrograms
- AudioEncoder - Encodes audio features using the Whisper encoder model
- TextDecoder - Decodes encoded features into text tokens
- Tokenizer - Converts tokens to readable text
Model Architecture
WhisperKit uses Core ML models downloaded from Hugging Face Hub:- MelSpectrogram.mlmodelc - Audio feature extraction
- AudioEncoder.mlmodelc - Audio encoding
- TextDecoder.mlmodelc - Text decoding
- TextDecoderContextPrefill.mlmodelc - Optional prefill optimization
Quick Start
Basic Initialization
Custom Configuration
Model States
WhisperKit models progress through several states:- unloaded - Models not yet loaded
- prewarming - Models being specialized for the device (optional)
- prewarmed - Specialization complete
- loading - Models being loaded into memory
- loaded - Ready for transcription
- unloading - Models being removed from memory
Sample Rates and Constants
WhisperKit uses fixed audio parameters matching the Whisper model requirements:Next Steps
Transcription
Learn how to transcribe audio files
Streaming
Real-time audio transcription
Model Selection
Choose the right model for your needs
Configuration
Advanced configuration options