Skip to main content
Efficient memory management is critical when deploying large Whisper models on Apple devices. This guide covers prewarming, model caching, resource allocation, and strategies for minimizing memory footprint.

Understanding Memory Usage

WhisperKit’s memory footprint consists of:

Model Weights

75 MB (tiny) to 3 GB (large-v3)

KV Cache

Dynamic during decoding

Audio Buffers

Mel spectrograms and features

Model Size Reference

ModelParametersDisk SizeMemory (Loaded)
tiny39M~75 MB~150 MB
base74M~142 MB~280 MB
small244M~466 MB~900 MB
medium769M~1.5 GB~3 GB
large-v31550M~3 GB~6 GB
Loaded memory includes model weights, CoreML runtime overhead, and active computation buffers.

Model Prewarming

What is Prewarming?

CoreML models are downloaded as device-agnostic .mlmodelc files and must be “specialized” (compiled) for your specific device chip before use. Apple caches these specialized models, but the cache is evicted:
  • After OS updates
  • When not used for extended periods
  • When system storage is low
Prewarming triggers specialization sequentially to minimize peak memory.

Configuration

import WhisperKit

// Enable prewarming
let config = WhisperKitConfig(
    model: "large-v3",
    prewarm: true
)

let pipe = try await WhisperKit(config)

Prewarming Workflow

1

Load Model 1

Load audio encoder → specialization if needed
2

Unload Model 1

Immediately release audio encoder memory
3

Load Model 2

Load text decoder → specialization if needed
4

Unload Model 2

Release text decoder memory
5

Final Load

Load all models together (now cached)

Trade-offs

Pros:
  • Lower peak memory (1 model at a time)
  • Safer for background/low-memory apps
  • Prevents crashes on older devices
Cons:
  • 2x longer load time when cache is hit
  • Unnecessary overhead if cache is fresh

When to Use Prewarming

  • Deploying on older iOS devices (iPhone 11 and earlier)
  • Using large models (medium, large-v3) on mobile
  • App runs in background or as extension
  • Memory pressure warnings occur
  • First launch after OS update
  • Deploying on M1/M2/M3 Macs with ample RAM
  • Using small models (tiny, base)
  • Load time is critical (real-time apps)
  • Models are loaded once and cached

CoreML Model Cache

Cache Location

Apple maintains CoreML specialized model cache outside your app bundle:
~/Library/Caches/com.apple.CoreML/
This cache is:
  • Managed by the OS (you cannot directly control it)
  • Device-specific (different for each chip)
  • Persistent across app launches
  • Evicted unpredictably by the system

Checking Cache Status

No official API exists, but you can measure load time:
import WhisperKit

let start = Date()
let config = WhisperKitConfig(model: "large-v3", verbose: true)
let pipe = try await WhisperKit(config)
let elapsed = Date().timeIntervalSince(start)

print("Model load time: \(elapsed)s")
// < 2s = cache hit
// > 10s = cache miss (compilation)

Prefilling KV Cache

Accelerate decoding with prefilled key-value cache:
var options = DecodingOptions(
    language: "en",
    task: .transcribe,
    usePrefillCache: true,   // Enable prefill
    usePrefillPrompt: true   // Force initial tokens
)

let result = try await pipe.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)
KV cache prefill reduces first-token latency by 2-3x by preloading common decoder states.

Resource Allocation

Compute Units and Memory

Different compute units have different memory characteristics:
CPU Only
.cpuOnly
  • Memory: Uses system RAM
  • Usage: 200-400 MB additional overhead
  • Best for: Extreme memory constraints
GPU
.cpuAndGPU
  • Memory: Uses unified memory (shared with CPU)
  • Usage: 300-600 MB additional overhead
  • Best for: Balanced performance
Neural Engine
.cpuAndNeuralEngine
  • Memory: Uses dedicated ANE memory + system RAM
  • Usage: 400-800 MB additional overhead
  • Best for: Maximum speed on supported devices

Optimizing Compute Units for Memory

import CoreML
import WhisperKit

// Minimal memory configuration
let computeOptions = ModelComputeOptions(
    melCompute: .cpuOnly,           // Lightest
    audioEncoderCompute: .cpuOnly,  // Most memory-efficient
    textDecoderCompute: .cpuOnly,
    prefillCompute: .cpuOnly
)

let config = WhisperKitConfig(
    model: "tiny",  // Smallest model
    computeOptions: computeOptions,
    prewarm: true   // Lower peak memory
)

Memory Monitoring

Runtime Memory Tracking

import WhisperKit
import os

func logMemoryUsage(_ label: String) {
    var info = mach_task_basic_info()
    var count = mach_msg_type_number_t(MemoryLayout<mach_task_basic_info>.size) / 4
    
    let result = withUnsafeMutablePointer(to: &info) {
        $0.withMemoryRebound(to: integer_t.self, capacity: 1) {
            task_info(mach_task_self_, task_flavor_t(MACH_TASK_BASIC_INFO), $0, &count)
        }
    }
    
    if result == KERN_SUCCESS {
        let usedMB = Double(info.resident_size) / 1024 / 1024
        print("\(label): \(String(format: "%.2f", usedMB)) MB")
    }
}

// Usage
logMemoryUsage("Before init")
let pipe = try await WhisperKit()
logMemoryUsage("After init")

let result = try await pipe.transcribe(audioPath: "audio.wav")
logMemoryUsage("After transcription")

Xcode Instruments

1

Open Instruments

Product → Profile (⌘I) in Xcode
2

Select Allocations

Choose “Allocations” template
3

Record Session

Run your app and transcribe audio
4

Analyze

Look for:
  • Peak memory usage
  • Memory growth over time
  • Allocation backtrace

Strategies for Large Models

Model Splitting

Load encoder and decoder separately:
import WhisperKit

// Load only encoder first
let encoderConfig = WhisperKitConfig(
    modelFolder: "models/large-v3",
    load: false  // Don't load automatically
)
let pipe = try await WhisperKit(encoderConfig)

// Manually load encoder
try await pipe.loadModels()  // Load encoder only

// Process audio
let features = try await pipe.audioEncoder.encodeFeatures(...)

// Later, load decoder when needed
try await pipe.loadTextDecoder()

Lazy Loading

Defer model loading until needed:
import WhisperKit

let config = WhisperKitConfig(
    model: "large-v3",
    load: false,      // Don't load on init
    download: true    // But download if missing
)

let pipe = try await WhisperKit(config)
// Models downloaded but not loaded

// Load when user triggers transcription
button.action = {
    try await pipe.loadModels()
    let result = try await pipe.transcribe(audioPath: "audio.wav")
}

Unloading Models

Free memory after transcription:
import WhisperKit

var pipe: WhisperKit? = try await WhisperKit()

// Use model
let result = try await pipe?.transcribe(audioPath: "audio.wav")

// Release memory
pipe = nil
// Or explicit unload if keeping reference
// pipe.unloadModels()  // If implemented

Concurrent Processing

Worker Count and Memory

More workers = more memory:
var options = DecodingOptions()

// High memory available
options.concurrentWorkerCount = 16  // macOS with >16GB RAM

// Medium memory
options.concurrentWorkerCount = 8   // macOS with 8-16GB RAM

// Low memory
options.concurrentWorkerCount = 4   // iOS devices

// Minimal memory
options.concurrentWorkerCount = 1   // Sequential processing
Each concurrent worker may hold its own audio buffer and KV cache state. Start with defaults and adjust based on memory pressure.

Sequential Processing

import WhisperKit

let pipe = try await WhisperKit(WhisperKitConfig(model: "large-v3"))

let audioFiles = ["file1.wav", "file2.wav", "file3.wav"]

// Process one at a time
for file in audioFiles {
    let result = try await pipe.transcribe(audioPath: file)
    print("\(file): \(result?.text ?? "")")
    
    // Optional: explicit cleanup
    // autoreleasepool { ... }
}

Audio Buffer Management

Chunking Strategy

VAD chunking reduces memory by processing smaller segments:
var options = DecodingOptions()
options.chunkingStrategy = .vad  // Voice Activity Detection

// VAD splits long audio into manageable chunks
// Smaller chunks = less memory per segment

Clip Timestamps

Manually segment long audio:
var options = DecodingOptions()
// Split 10-minute audio into 2-minute segments
options.clipTimestamps = [
    0.0,    // Start
    120.0,  // 2 min
    240.0,  // 4 min
    360.0,  // 6 min
    480.0,  // 8 min
    600.0   // 10 min (end)
]

let result = try await pipe.transcribe(
    audioPath: "long_audio.wav",
    decodeOptions: options
)

Platform-Specific Guidance

iOS Memory Limits

iOS has stricter memory limits than macOS:
// Use tiny or base models only
let config = WhisperKitConfig(
    model: "tiny",
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuAndGPU,
        textDecoderCompute: .cpuOnly
    ),
    prewarm: true
)

macOS Memory Guidelines

8 GB RAM
M1/M2 Base
  • Recommended: small or distil*medium
  • Max: medium with prewarming
  • Avoid: large-v3 (may cause swapping)
16 GB RAM
M1 Pro/M2 Pro
  • Recommended: medium or distil*large-v3
  • Max: large-v3 comfortably
  • Concurrent workers: 8-16
32+ GB RAM
M1 Max/M2 Max/M3 Max
  • Use any model including large-v3
  • Multiple instances possible
  • Concurrent workers: 16+

Background Execution

App Extensions

// In app extension (widget, share extension, etc.)
import WhisperKit

// Use smallest model and enable prewarming
let config = WhisperKitConfig(
    model: "tiny",
    computeOptions: ModelComputeOptions(
        melCompute: .cpuOnly,
        audioEncoderCompute: .cpuOnly,
        textDecoderCompute: .cpuOnly,
        prefillCompute: .cpuOnly
    ),
    prewarm: true
)

let pipe = try await WhisperKit(config)

Background Tasks

import BackgroundTasks
import WhisperKit

func scheduleTranscription() {
    let request = BGProcessingTaskRequest(identifier: "com.app.transcribe")
    request.requiresNetworkConnectivity = false
    request.requiresExternalPower = false  // Battery-friendly
    
    try? BGTaskScheduler.shared.submit(request)
}

BGTaskScheduler.shared.register(forTaskWithIdentifier: "com.app.transcribe") { task in
    task.expirationHandler = {
        // Clean up
        pipe = nil
    }
    
    Task {
        // Use small model for background
        let pipe = try await WhisperKit(WhisperKitConfig(
            model: "tiny",
            prewarm: true
        ))
        
        let result = try await pipe.transcribe(audioPath: "audio.wav")
        // Save result
        
        task.setTaskCompleted(success: true)
    }
}

Troubleshooting Memory Issues

Solutions:
  • Enable prewarming: prewarm: true
  • Use smaller model: tiny or base
  • Switch to CPU-only: computeOptions with .cpuOnly
  • Close other apps to free memory
Solutions:
  • Reduce concurrent workers: concurrentWorkerCount = 1
  • Enable VAD chunking: chunkingStrategy = .vad
  • Process files sequentially instead of in parallel
  • Use clip timestamps to segment long audio
May be memory pressure causing throttling:
  • Monitor with Xcode Instruments
  • Explicitly release unused references
  • Consider smaller model or reduced worker count
Cache being evicted:
  • Check available disk space (cache requires ~2x model size)
  • Verify OS version (cache behavior varies)
  • Consider bundling pre-compiled models (advanced)

Best Practices Summary

1

Profile first

Use Xcode Instruments to establish baseline memory usage
2

Choose appropriate model

Match model size to device capabilities and requirements
3

Enable prewarming on mobile

Always use prewarm: true on iOS devices
4

Optimize compute units

Balance performance and memory with appropriate compute units
5

Limit concurrency

Don’t exceed recommended worker counts for platform
6

Use chunking

Enable VAD for long audio files
7

Monitor in production

Log memory metrics and watch for pressure warnings
8

Test on oldest devices

Ensure app works on minimum supported hardware

Next Steps

Performance Optimization

Optimize speed and quality

Custom Models

Deploy optimized custom models

Build docs developers (and LLMs) love