Optimizing WhisperKit performance involves balancing speed, accuracy, and resource usage. This guide covers compute units, model selection, decoding options, and platform-specific optimizations.
Compute Units
CoreML models can target different hardware accelerators on Apple devices:
Neural Engine Specialized ML accelerator (fastest, most efficient)
GPU Graphics processor (good balance)
CPU Central processor (most compatible)
ModelComputeOptions
Configure compute units for each model component:
import WhisperKit
import CoreML
let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU , // Mel spectrogram extraction
audioEncoderCompute : . cpuAndNeuralEngine , // Audio encoder
textDecoderCompute : . cpuAndNeuralEngine , // Text decoder
prefillCompute : . cpuOnly // KV cache prefill
)
let config = WhisperKitConfig (
model : "large-v3" ,
computeOptions : computeOptions
)
let pipe = try await WhisperKit (config)
Available Compute Units
CPU only - most compatible, slowest
CPU and GPU - good for macOS < 14
CPU and Neural Engine - recommended for macOS 14+, iOS 17+
All available compute units - lets CoreML decide
Recommended Configurations
macOS 14+ (Recommended)
macOS 13 and Earlier
iOS 18+
Low Memory
let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU ,
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine ,
prefillCompute : . cpuOnly
)
Best performance on M1/M2/M3 Macs let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU ,
audioEncoderCompute : . cpuAndGPU ,
textDecoderCompute : . cpuAndNeuralEngine ,
prefillCompute : . cpuOnly
)
Neural Engine support limited for audio encoder let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU ,
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine ,
prefillCompute : . cpuOnly
)
Optimized for iPhone/iPad let computeOptions = ModelComputeOptions (
melCompute : . cpuOnly ,
audioEncoderCompute : . cpuOnly ,
textDecoderCompute : . cpuOnly ,
prefillCompute : . cpuOnly
)
Minimize memory usage
CLI Configuration
Set compute units via command-line:
swift run whisperkit-cli transcribe \
--audio-path "audio.wav" \
--model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
--audio-encoder-compute-units cpuAndNeuralEngine \
--text-decoder-compute-units cpuAndNeuralEngine
Model Selection
Model size significantly impacts speed and accuracy:
Speed : ~32x real-time | WER : ~10-15% | Size : ~75 MBBest for: Real-time applications, low-end devices, quick prototyping
Speed : ~16x real-time | WER : ~8-12% | Size : ~142 MBBest for: Balanced speed/quality, general transcription
Speed : ~6x real-time | WER : ~5-8% | Size : ~466 MBBest for: Production applications requiring accuracy
Speed : ~2x real-time | WER : ~4-6% | Size : ~1.5 GBBest for: High-accuracy requirements, offline processing
Speed : ~1x real-time | WER : ~3-5% | Size : ~3 GBBest for: Maximum accuracy, multilingual, post-processing
Distil Models
Distilled models offer 2-3x speedup with minimal accuracy loss:
let config = WhisperKitConfig (
model : "distil*large-v3" , // Glob pattern
modelRepo : "argmaxinc/whisperkit-coreml"
)
Benchmarks: WhisperKit Benchmarks
Decoding Options
Temperature and Sampling
Control randomness and diversity:
var options = DecodingOptions ()
options. temperature = 0.0 // Deterministic (default)
options. topK = 5 // Top-K sampling candidates
0.0: Deterministic (always pick most likely token)
0.0 - 1.0: Increasing randomness
Higher values = more creative but less accurate
Number of top candidates to sample from when temperature > 0
Fallback Strategy
Automatically retry failed segments:
var options = DecodingOptions ()
options. temperatureIncrementOnFallback = 0.2 // Increment per retry
options. temperatureFallbackCount = 5 // Max retries
Quality Thresholds
Detect and reject poor transcriptions:
var options = DecodingOptions ()
options. compressionRatioThreshold = 2.4 // Detect repetition
options. logProbThreshold = -1.0 // Average confidence
options. firstTokenLogProbThreshold = -1.5 // First token confidence
options. noSpeechThreshold = 0.6 // Silence detection
Lower thresholds = more rejections = better quality but longer processing Default values work well for most use cases.
Timestamp Control
var options = DecodingOptions ()
options. withoutTimestamps = false // Include timestamps
options. wordTimestamps = true // Word-level timing
options. maxInitialTimestamp = 1.0 // First timestamp limit
Word timestamps require ~10-15% more processing time.
Parallel Processing
Concurrent Workers
Process multiple audio segments in parallel:
var options = DecodingOptions ()
# if os ( macOS )
options. concurrentWorkerCount = 16 // Default on macOS
# else
options. concurrentWorkerCount = 4 // Default on iOS/iPadOS
# endif
iOS devices show regression with >4 workers. macOS handles 16+ workers efficiently.
Chunking Strategy
var options = DecodingOptions ()
options. chunkingStrategy = . vad // Voice Activity Detection
// or
options. chunkingStrategy = . none // Process entire audio
Voice Activity Detection (VAD) :
Automatically splits audio at silence
Reduces unnecessary processing
Better for long audio with pauses
Slightly slower initialization
No Chunking :
Process entire audio file
Faster for short clips
May hit token limits on very long audio
Prefill Optimization
KV Cache Prefill
Accelerate initial decoding with cached key-value pairs:
var options = DecodingOptions ()
options. usePrefillCache = true // Use prefilled KV cache (faster)
options. usePrefillPrompt = true // Force initial prompt tokens
Prefill reduces first-token latency by 2-3x but requires compatible models with prefill data.
Language and Task Prefill
var options = DecodingOptions (
task : . transcribe ,
language : "en" ,
usePrefillPrompt : true ,
usePrefillCache : true
)
Automatic prefill when:
Language is specified
Task is .translate
Custom prompt tokens provided
Memory Management
Model Prewarming
Reduce peak memory during model loading:
let config = WhisperKitConfig (
model : "large-v3" ,
prewarm : true // Load-unload-load pattern
)
CoreML models need “specialization” on first load (compiling for your device). This specialized cache is maintained by Apple but evicted after OS updates. With prewarm=true :
Models loaded sequentially
Each model unloaded after specialization
Lower peak memory (1 model at a time)
2x longer load time if cache is hit
With prewarm=false (default):
Models loaded in parallel
Higher peak memory (all models + compilation)
Faster load time when cache is hit
Enable when : Minimizing peak memory is critical (older devices, background apps)Disable when : Load time is critical (real-time apps, foreground processing)
let config = WhisperKitConfig (
model : "large-v3" ,
prewarm : true ,
verbose : true // See timing breakdown
)
let pipe = try await WhisperKit (config)
// Logs: prewarmLoadTime, modelLoading, encoderLoadTime, decoderLoadTime
See Memory Management for detailed strategies.
Real-Time Factor
let result = try await pipe. transcribe ( audioPath : "audio.wav" )
let rtf = result ? . timings . realTimeFactor ?? 0
print ( "Real-time factor: \( rtf ) " )
// 0.25 = 4x faster than real-time
// 1.0 = same speed as audio duration
// 2.0 = 2x slower than real-time
Tokens Per Second
let tps = result ? . timings . tokensPerSecond ?? 0
print ( "Tokens per second: \( tps ) " )
// Higher is better (typically 50-200 for large models)
Speed Factor
let speedFactor = result ? . timings . speedFactor ?? 0
print ( "Speed factor: \( speedFactor ) x" )
// Inverse of real-time factor
// 4.0 = 4x faster than real-time
macOS
M1/M2/M3 Optimization let computeOptions = ModelComputeOptions (
melCompute : . cpuAndGPU ,
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine ,
prefillCompute : . cpuOnly
)
var decodingOptions = DecodingOptions ()
decodingOptions. concurrentWorkerCount = 16
decodingOptions. chunkingStrategy = . vad
iOS/iPadOS
iPhone/iPad Optimization // Use smaller models on mobile
let config = WhisperKitConfig (
model : "small" , // or "distil*small"
computeOptions : ModelComputeOptions (
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine
)
)
var decodingOptions = DecodingOptions ()
decodingOptions. concurrentWorkerCount = 4 // Don't exceed 4
watchOS
WhisperKit on watchOS requires tiny/base models with CPU-only compute units due to memory constraints.
Benchmarking
Measure performance on your target devices:
View results:
See BENCHMARKS.md for details.
Optimization Checklist
Choose the right model
Start with small for balance, tiny for speed, large-v3 for accuracy
Configure compute units
Use .cpuAndNeuralEngine on macOS 14+ and iOS 17+
Enable prefill caching
Set usePrefillCache = true for faster first-token generation
Adjust concurrent workers
16 on macOS, 4 on iOS for optimal parallelism
Use VAD chunking
Enable chunkingStrategy = .vad for long audio files
Tune quality thresholds
Lower thresholds for better quality, higher for speed
Monitor metrics
Track real-time factor and tokens per second
Test on target devices
Always benchmark on actual hardware
Next Steps
Memory Management Advanced memory optimization strategies
Custom Models Fine-tune models for your domain