WhisperKit Models
All WhisperKit models are hosted on HuggingFace in CoreML format, optimized for Apple Neural Engine.
Model Repository
WhisperKit CoreML Models
Browse all available models on HuggingFace
Standard Whisper Models
tiny
base
small
medium
large-v3
Tiny
Model ID: openai_whisper-tinyBest for:
- Quick testing and prototyping
- Resource-constrained devices
- When speed is more important than accuracy
- iPhone 13 and earlier devices
Performance:
- Real-time on all supported devices
- WER (Word Error Rate): ~15-20% on English
- RTF < 0.2 on most devices
Usage:let pipe = try await WhisperKit(WhisperKitConfig(model: "tiny"))
Base
Model ID: openai_whisper-baseBest for:
- Balance of speed and accuracy
- General-purpose transcription
- iPhone 13 and newer
- Apps requiring low latency
Performance:
- Real-time on all supported devices
- WER: ~10-15% on English
- RTF < 0.3 on most devices
Usage:let pipe = try await WhisperKit(WhisperKitConfig(model: "base"))
Small
Model ID: openai_whisper-smallBest for:
- Good accuracy with reasonable speed
- iPhone 14 and newer
- M1 Macs and above
- Production applications
Performance:
- Real-time on iPhone 14 Pro and newer
- WER: ~8-12% on English
- RTF ~0.4-0.6 on modern devices
Usage:let pipe = try await WhisperKit(WhisperKitConfig(model: "small"))
Medium
Model ID: openai_whisper-mediumBest for:
- High accuracy requirements
- iPhone 15 Pro and newer
- M1 Macs and above
- Offline transcription tasks
Performance:
- Real-time on iPhone 15 Pro, M1 Mac+
- WER: ~6-9% on English
- RTF ~0.7-1.0 on modern devices
Usage:let pipe = try await WhisperKit(WhisperKitConfig(model: "medium"))
Large V3
Model ID: openai_whisper-large-v3Best for:
- Maximum accuracy
- Desktop/server applications
- Mac Studio, MacBook Pro (M3 Pro+)
- Offline high-quality transcription
Performance:
- Real-time on M2 Pro and above
- WER: ~4-6% on English
- RTF ~1.2-2.0 depending on device
- Best multilingual support
Usage:let pipe = try await WhisperKit(WhisperKitConfig(model: "large-v3"))
Distilled Models
Distilled models provide significant performance improvements with minimal accuracy loss through knowledge distillation.
Distil-Large-V3
Model ID: distil-whisper_distil-large-v3
vs. Large-V3
50% smaller, 2x faster
Advantages:
- Significantly faster than large-v3
- Much smaller download and memory footprint
- Near-identical accuracy to large-v3
- Real-time on iPhone 15 Pro
- Recommended for most use cases
Performance:
- WER: ~5-7% on English
- RTF ~0.6-0.9 on modern devices
- Runs well on iPhone 14 Pro and newer
Usage:
let pipe = try await WhisperKit(WhisperKitConfig(model: "distil*large-v3"))
// Glob pattern matches distil-whisper_distil-large-v3
Other Distilled Models
Several other distilled variants are available in the model repository:
distil-whisper_distil-medium.en
distil-whisper_distil-small.en
These are English-only models optimized for even faster inference.
Model Selection Guide
By Device
By Use Case
By Language
iPhone
| Device | Recommended | Real-Time |
|---|
| iPhone 15 Pro | distil-large-v3, medium | large-v3 |
| iPhone 14 Pro | medium, small | medium |
| iPhone 13 Pro | small, base | small |
| iPhone 12/13 | base, tiny | base |
iPad
| Device | Recommended | Real-Time |
|---|
| iPad Pro (M1+) | large-v3, distil-large-v3 | large-v3 |
| iPad Air (M1+) | medium, distil-large-v3 | medium |
| iPad (A14+) | small, base | small |
Mac
| Device | Recommended | Real-Time |
|---|
| Mac Studio (Ultra) | large-v3 | All models |
| MacBook Pro (M3 Pro+) | large-v3 | large-v3 |
| MacBook Air (M1+) | distil-large-v3, medium | medium |
| Mac mini (M1+) | medium, small | small |
Real-Time Streaming
- Best: tiny, base, small
- Good: medium (on powerful devices)
- Use: distil-large-v3 for best accuracy/speed balance
Offline Transcription
- Best: large-v3, distil-large-v3
- Good: medium
- When speed matters: small, base
Multilingual
- Best: large-v3 (supports 99+ languages)
- Good: medium (good multilingual support)
- Acceptable: small (limited multilingual)
- Not recommended: tiny, base (poor multilingual)
Low-Resource Devices
- Best: tiny, base
- Alternative: distil-small.en (English only)
- Consider: Server-based transcription for accuracy
High Accuracy
- Best: large-v3
- Nearly as good: distil-large-v3 (much faster)
- Good: medium
English Only
- Best accuracy: large-v3, distil-large-v3
- Best speed: distil-small.en, distil-medium.en
- Balanced: small, medium
Multilingual (99+ languages)
All standard models support multiple languages:
- Best: large-v3
- Good: medium
- Acceptable: small
- Limited: base, tiny
Large models perform better on:
- Chinese (Mandarin, Cantonese)
- Japanese
- Arabic
- Korean
- Indian languages
Small models are sufficient for:
- Spanish
- French
- German
- Italian
- Portuguese
Custom Models
Creating Custom Models
Fine-tune Whisper
Use whisperkittools to fine-tune on your dataset:python -m whisperkittools.train \
--model large-v3 \
--dataset your_dataset \
--output-dir custom_model
Convert to CoreML
Convert the fine-tuned model to CoreML:python -m whisperkittools.convert \
--model custom_model \
--output-dir coreml_model
Upload to HuggingFace
Upload to your HuggingFace repository:huggingface-cli upload username/model-repo coreml_model
Use in WhisperKit
Load your custom model:let config = WhisperKitConfig(
model: "large-v3",
modelRepo: "username/model-repo"
)
let pipe = try await WhisperKit(config)
Use Cases for Custom Models
- Domain-specific vocabulary (medical, legal, technical)
- Accents and dialects
- Background noise handling
- Custom wake words
- Language variants
TTSKit Models
Qwen3 TTS 0.6B
Model ID: qwen3TTS_0_6bFeatures:
- 9 voices
- 10 languages
- Real-time streaming
- Runs on all platforms
Performance:
- Generates ~2-3s audio per second on M1
- Suitable for real-time playback
- Lower memory requirements
Usage:let tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_0_6b))
let result = try await tts.generate(text: "Hello!")
Qwen3 TTS 1.7B
Model ID: qwen3TTS_1_7bFeatures:
- 9 voices (same as 0.6B)
- 10 languages
- Style instructions (unique to 1.7B)
- Better prosody and naturalness
Performance:
- Generates ~1-2s audio per second on M1
- Requires more memory (~4 GB)
- macOS 15.0+ required
Usage:let tts = try await TTSKit(TTSKitConfig(model: .qwen3TTS_1_7b))
var options = GenerationOptions()
options.instruction = "Speak warmly and slowly."
let result = try await tts.generate(
text: "Hello!",
options: options
)
TTSKit Voices
All models support these 9 voices:
| Voice | Style | Best For |
|---|
.ryan | Clear, professional | Business, narration |
.aiden | Warm, friendly | Customer service |
.onoAnna | Bright, energetic | Announcements |
.sohee | Calm, soothing | Meditation, audiobooks |
.eric | Deep, authoritative | News, presentations |
.dylan | Young, casual | Social media, gaming |
.serena | Elegant, refined | Luxury brands |
.vivian | Confident, dynamic | Fitness, motivation |
.uncleFu | Wise, mature | Storytelling, teaching |
TTSKit Languages
- English
- Chinese (Mandarin)
- Japanese
- Korean
- German
- French
- Russian
- Portuguese
- Spanish
- Italian
Model Download
Automatic Download
WhisperKit automatically downloads the recommended model on first use:
// Downloads default model for device
let pipe = try await WhisperKit()
Manual Download
Download specific models via CLI:
# Download single model
make download-model MODEL=large-v3
# Download all models
make download-models
Model Caching
Models are cached at:
- macOS:
~/.cache/whisperkit/
- iOS: App’s cache directory
To clear cache:
rm -rf ~/.cache/whisperkit/
View Detailed Benchmarks
Compare performance across devices and models
Next Steps
Supported Devices
Check device compatibility
Benchmarks
Run performance tests
Quick Start
Start transcribing
Custom Models
Create fine-tuned models