Efficient memory management is critical when deploying large Whisper models on Apple devices. This guide covers prewarming, model caching, resource allocation, and strategies for minimizing memory footprint.
Understanding Memory Usage
WhisperKit’s memory footprint consists of:
Model Weights 75 MB (tiny) to 3 GB (large-v3)
KV Cache Dynamic during decoding
Audio Buffers Mel spectrograms and features
Model Size Reference
Model Parameters Disk Size Memory (Loaded) tiny 39M ~75 MB ~150 MB base 74M ~142 MB ~280 MB small 244M ~466 MB ~900 MB medium 769M ~1.5 GB ~3 GB large-v3 1550M ~3 GB ~6 GB
Loaded memory includes model weights, CoreML runtime overhead, and active computation buffers.
Model Prewarming
What is Prewarming?
CoreML models are downloaded as device-agnostic .mlmodelc files and must be “specialized” (compiled) for your specific device chip before use. Apple caches these specialized models, but the cache is evicted:
After OS updates
When not used for extended periods
When system storage is low
Prewarming triggers specialization sequentially to minimize peak memory.
Configuration
import WhisperKit
// Enable prewarming
let config = WhisperKitConfig (
model : "large-v3" ,
prewarm : true
)
let pipe = try await WhisperKit (config)
Prewarming Workflow
Load Model 1
Load audio encoder → specialization if needed
Unload Model 1
Immediately release audio encoder memory
Load Model 2
Load text decoder → specialization if needed
Unload Model 2
Release text decoder memory
Final Load
Load all models together (now cached)
Trade-offs
prewarm: true
prewarm: false (default)
Pros:
Lower peak memory (1 model at a time)
Safer for background/low-memory apps
Prevents crashes on older devices
Cons:
2x longer load time when cache is hit
Unnecessary overhead if cache is fresh
Pros:
Faster load time when cache is hit
Models loaded in parallel
Better for real-time applications
Cons:
Higher peak memory during compilation
May cause memory pressure on older devices
Can trigger system warnings
When to Use Prewarming
Deploying on older iOS devices (iPhone 11 and earlier)
Using large models (medium, large-v3) on mobile
App runs in background or as extension
Memory pressure warnings occur
First launch after OS update
Deploying on M1/M2/M3 Macs with ample RAM
Using small models (tiny, base)
Load time is critical (real-time apps)
Models are loaded once and cached
CoreML Model Cache
Cache Location
Apple maintains CoreML specialized model cache outside your app bundle:
~/Library/Caches/com.apple.CoreML/
This cache is:
Managed by the OS (you cannot directly control it)
Device-specific (different for each chip)
Persistent across app launches
Evicted unpredictably by the system
Checking Cache Status
No official API exists, but you can measure load time:
import WhisperKit
let start = Date ()
let config = WhisperKitConfig ( model : "large-v3" , verbose : true )
let pipe = try await WhisperKit (config)
let elapsed = Date (). timeIntervalSince (start)
print ( "Model load time: \( elapsed ) s" )
// < 2s = cache hit
// > 10s = cache miss (compilation)
Prefilling KV Cache
Accelerate decoding with prefilled key-value cache:
var options = DecodingOptions (
language : "en" ,
task : . transcribe ,
usePrefillCache : true , // Enable prefill
usePrefillPrompt : true // Force initial tokens
)
let result = try await pipe. transcribe (
audioPath : "audio.wav" ,
decodeOptions : options
)
KV cache prefill reduces first-token latency by 2-3x by preloading common decoder states.
Resource Allocation
Compute Units and Memory
Different compute units have different memory characteristics:
Memory : Uses system RAM
Usage : 200-400 MB additional overhead
Best for : Extreme memory constraints
Memory : Uses unified memory (shared with CPU)
Usage : 300-600 MB additional overhead
Best for : Balanced performance
Memory : Uses dedicated ANE memory + system RAM
Usage : 400-800 MB additional overhead
Best for : Maximum speed on supported devices
Optimizing Compute Units for Memory
import CoreML
import WhisperKit
// Minimal memory configuration
let computeOptions = ModelComputeOptions (
melCompute : . cpuOnly , // Lightest
audioEncoderCompute : . cpuOnly , // Most memory-efficient
textDecoderCompute : . cpuOnly ,
prefillCompute : . cpuOnly
)
let config = WhisperKitConfig (
model : "tiny" , // Smallest model
computeOptions : computeOptions,
prewarm : true // Lower peak memory
)
Memory Monitoring
Runtime Memory Tracking
import WhisperKit
import os
func logMemoryUsage ( _ label : String ) {
var info = mach_task_basic_info ()
var count = mach_msg_type_number_t ( MemoryLayout < mach_task_basic_info > . size ) / 4
let result = withUnsafeMutablePointer ( to : & info) {
$0 . withMemoryRebound ( to : integer_t. self , capacity : 1 ) {
task_info (mach_task_self_, task_flavor_t (MACH_TASK_BASIC_INFO), $0 , & count)
}
}
if result == KERN_SUCCESS {
let usedMB = Double (info. resident_size ) / 1024 / 1024
print ( " \( label ) : \( String ( format : "%.2f" , usedMB) ) MB" )
}
}
// Usage
logMemoryUsage ( "Before init" )
let pipe = try await WhisperKit ()
logMemoryUsage ( "After init" )
let result = try await pipe. transcribe ( audioPath : "audio.wav" )
logMemoryUsage ( "After transcription" )
Xcode Instruments
Open Instruments
Product → Profile (⌘I) in Xcode
Select Allocations
Choose “Allocations” template
Record Session
Run your app and transcribe audio
Analyze
Look for:
Peak memory usage
Memory growth over time
Allocation backtrace
Strategies for Large Models
Model Splitting
Load encoder and decoder separately:
import WhisperKit
// Load only encoder first
let encoderConfig = WhisperKitConfig (
modelFolder : "models/large-v3" ,
load : false // Don't load automatically
)
let pipe = try await WhisperKit (encoderConfig)
// Manually load encoder
try await pipe. loadModels () // Load encoder only
// Process audio
let features = try await pipe. audioEncoder . encodeFeatures ( ... )
// Later, load decoder when needed
try await pipe. loadTextDecoder ()
Lazy Loading
Defer model loading until needed:
import WhisperKit
let config = WhisperKitConfig (
model : "large-v3" ,
load : false , // Don't load on init
download : true // But download if missing
)
let pipe = try await WhisperKit (config)
// Models downloaded but not loaded
// Load when user triggers transcription
button. action = {
try await pipe. loadModels ()
let result = try await pipe. transcribe ( audioPath : "audio.wav" )
}
Unloading Models
Free memory after transcription:
import WhisperKit
var pipe: WhisperKit ? = try await WhisperKit ()
// Use model
let result = try await pipe ? . transcribe ( audioPath : "audio.wav" )
// Release memory
pipe = nil
// Or explicit unload if keeping reference
// pipe.unloadModels() // If implemented
Concurrent Processing
Worker Count and Memory
More workers = more memory:
var options = DecodingOptions ()
// High memory available
options. concurrentWorkerCount = 16 // macOS with >16GB RAM
// Medium memory
options. concurrentWorkerCount = 8 // macOS with 8-16GB RAM
// Low memory
options. concurrentWorkerCount = 4 // iOS devices
// Minimal memory
options. concurrentWorkerCount = 1 // Sequential processing
Each concurrent worker may hold its own audio buffer and KV cache state. Start with defaults and adjust based on memory pressure.
Sequential Processing
import WhisperKit
let pipe = try await WhisperKit ( WhisperKitConfig ( model : "large-v3" ))
let audioFiles = [ "file1.wav" , "file2.wav" , "file3.wav" ]
// Process one at a time
for file in audioFiles {
let result = try await pipe. transcribe ( audioPath : file)
print ( " \( file ) : \( result ? . text ?? "" ) " )
// Optional: explicit cleanup
// autoreleasepool { ... }
}
Audio Buffer Management
Chunking Strategy
VAD chunking reduces memory by processing smaller segments:
var options = DecodingOptions ()
options. chunkingStrategy = . vad // Voice Activity Detection
// VAD splits long audio into manageable chunks
// Smaller chunks = less memory per segment
Clip Timestamps
Manually segment long audio:
var options = DecodingOptions ()
// Split 10-minute audio into 2-minute segments
options. clipTimestamps = [
0.0 , // Start
120.0 , // 2 min
240.0 , // 4 min
360.0 , // 6 min
480.0 , // 8 min
600.0 // 10 min (end)
]
let result = try await pipe. transcribe (
audioPath : "long_audio.wav" ,
decodeOptions : options
)
iOS Memory Limits
iOS has stricter memory limits than macOS:
iPhone 11 and earlier
iPhone 12-14
iPhone 15+
// Use tiny or base models only
let config = WhisperKitConfig (
model : "tiny" ,
computeOptions : ModelComputeOptions (
audioEncoderCompute : . cpuAndGPU ,
textDecoderCompute : . cpuOnly
),
prewarm : true
)
// Small models work well
let config = WhisperKitConfig (
model : "small" ,
computeOptions : ModelComputeOptions (
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine
),
prewarm : true
)
// Can handle medium models
let config = WhisperKitConfig (
model : "medium" ,
computeOptions : ModelComputeOptions (
audioEncoderCompute : . cpuAndNeuralEngine ,
textDecoderCompute : . cpuAndNeuralEngine
),
prewarm : false // Sufficient memory
)
macOS Memory Guidelines
Recommended: small or distil*medium
Max: medium with prewarming
Avoid: large-v3 (may cause swapping)
Recommended: medium or distil*large-v3
Max: large-v3 comfortably
Concurrent workers: 8-16
Use any model including large-v3
Multiple instances possible
Concurrent workers: 16+
Background Execution
App Extensions
// In app extension (widget, share extension, etc.)
import WhisperKit
// Use smallest model and enable prewarming
let config = WhisperKitConfig (
model : "tiny" ,
computeOptions : ModelComputeOptions (
melCompute : . cpuOnly ,
audioEncoderCompute : . cpuOnly ,
textDecoderCompute : . cpuOnly ,
prefillCompute : . cpuOnly
),
prewarm : true
)
let pipe = try await WhisperKit (config)
Background Tasks
import BackgroundTasks
import WhisperKit
func scheduleTranscription () {
let request = BGProcessingTaskRequest ( identifier : "com.app.transcribe" )
request. requiresNetworkConnectivity = false
request. requiresExternalPower = false // Battery-friendly
try ? BGTaskScheduler. shared . submit (request)
}
BGTaskScheduler. shared . register ( forTaskWithIdentifier : "com.app.transcribe" ) { task in
task. expirationHandler = {
// Clean up
pipe = nil
}
Task {
// Use small model for background
let pipe = try await WhisperKit ( WhisperKitConfig (
model : "tiny" ,
prewarm : true
))
let result = try await pipe. transcribe ( audioPath : "audio.wav" )
// Save result
task. setTaskCompleted ( success : true )
}
}
Troubleshooting Memory Issues
App crashes on model load
Solutions:
Enable prewarming: prewarm: true
Use smaller model: tiny or base
Switch to CPU-only: computeOptions with .cpuOnly
Close other apps to free memory
Memory warnings during transcription
Solutions:
Reduce concurrent workers: concurrentWorkerCount = 1
Enable VAD chunking: chunkingStrategy = .vad
Process files sequentially instead of in parallel
Use clip timestamps to segment long audio
Slow performance after first transcription
Models recompiling frequently
Cache being evicted:
Check available disk space (cache requires ~2x model size)
Verify OS version (cache behavior varies)
Consider bundling pre-compiled models (advanced)
Best Practices Summary
Profile first
Use Xcode Instruments to establish baseline memory usage
Choose appropriate model
Match model size to device capabilities and requirements
Enable prewarming on mobile
Always use prewarm: true on iOS devices
Optimize compute units
Balance performance and memory with appropriate compute units
Limit concurrency
Don’t exceed recommended worker counts for platform
Use chunking
Enable VAD for long audio files
Monitor in production
Log memory metrics and watch for pressure warnings
Test on oldest devices
Ensure app works on minimum supported hardware
Next Steps
Performance Optimization Optimize speed and quality
Custom Models Deploy optimized custom models