Memory Management - WhisperKit

Efficient memory management is critical when deploying large Whisper models on Apple devices. This guide covers prewarming, model caching, resource allocation, and strategies for minimizing memory footprint.

Understanding Memory Usage

WhisperKit’s memory footprint consists of:

Model Weights

75 MB (tiny) to 3 GB (large-v3)

KV Cache

Dynamic during decoding

Audio Buffers

Mel spectrograms and features

Model Size Reference

Model	Parameters	Disk Size	Memory (Loaded)
tiny	39M	~75 MB	~150 MB
base	74M	~142 MB	~280 MB
small	244M	~466 MB	~900 MB
medium	769M	~1.5 GB	~3 GB
large-v3	1550M	~3 GB	~6 GB

Loaded memory includes model weights, CoreML runtime overhead, and active computation buffers.

Model Prewarming

What is Prewarming?

CoreML models are downloaded as device-agnostic .mlmodelc files and must be “specialized” (compiled) for your specific device chip before use. Apple caches these specialized models, but the cache is evicted:

After OS updates
When not used for extended periods
When system storage is low

Prewarming triggers specialization sequentially to minimize peak memory.

Configuration

import WhisperKit

// Enable prewarming
let config = WhisperKitConfig(
    model: "large-v3",
    prewarm: true
)

let pipe = try await WhisperKit(config)

Prewarming Workflow

Load Model 1

Load audio encoder → specialization if needed

Unload Model 1

Immediately release audio encoder memory

Load Model 2

Load text decoder → specialization if needed

Unload Model 2

Release text decoder memory

Final Load

Load all models together (now cached)

Trade-offs

prewarm: true
prewarm: false (default)

Pros:

Lower peak memory (1 model at a time)
Safer for background/low-memory apps
Prevents crashes on older devices

Cons:

2x longer load time when cache is hit
Unnecessary overhead if cache is fresh

When to Use Prewarming

Enable prewarming when:

Deploying on older iOS devices (iPhone 11 and earlier)
Using large models (medium, large-v3) on mobile
App runs in background or as extension
Memory pressure warnings occur
First launch after OS update

Disable prewarming when:

Deploying on M1/M2/M3 Macs with ample RAM
Using small models (tiny, base)
Load time is critical (real-time apps)
Models are loaded once and cached

CoreML Model Cache

Cache Location

Apple maintains CoreML specialized model cache outside your app bundle:

~/Library/Caches/com.apple.CoreML/

This cache is:

Managed by the OS (you cannot directly control it)
Device-specific (different for each chip)
Persistent across app launches
Evicted unpredictably by the system

Checking Cache Status

No official API exists, but you can measure load time:

import WhisperKit

let start = Date()
let config = WhisperKitConfig(model: "large-v3", verbose: true)
let pipe = try await WhisperKit(config)
let elapsed = Date().timeIntervalSince(start)

print("Model load time: \(elapsed)s")
// < 2s = cache hit
// > 10s = cache miss (compilation)

Prefilling KV Cache

Accelerate decoding with prefilled key-value cache:

var options = DecodingOptions(
    language: "en",
    task: .transcribe,
    usePrefillCache: true,   // Enable prefill
    usePrefillPrompt: true   // Force initial tokens
)

let result = try await pipe.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)

KV cache prefill reduces first-token latency by 2-3x by preloading common decoder states.

Resource Allocation

Compute Units and Memory

Different compute units have different memory characteristics:

CPU Only

.cpuOnly

Memory: Uses system RAM
Usage: 200-400 MB additional overhead
Best for: Extreme memory constraints

GPU

.cpuAndGPU

Memory: Uses unified memory (shared with CPU)
Usage: 300-600 MB additional overhead
Best for: Balanced performance

Neural Engine

.cpuAndNeuralEngine

Memory: Uses dedicated ANE memory + system RAM
Usage: 400-800 MB additional overhead
Best for: Maximum speed on supported devices

Optimizing Compute Units for Memory

import CoreML
import WhisperKit

// Minimal memory configuration
let computeOptions = ModelComputeOptions(
    melCompute: .cpuOnly,           // Lightest
    audioEncoderCompute: .cpuOnly,  // Most memory-efficient
    textDecoderCompute: .cpuOnly,
    prefillCompute: .cpuOnly
)

let config = WhisperKitConfig(
    model: "tiny",  // Smallest model
    computeOptions: computeOptions,
    prewarm: true   // Lower peak memory
)

Memory Monitoring

Runtime Memory Tracking

import WhisperKit
import os

func logMemoryUsage(_ label: String) {
    var info = mach_task_basic_info()
    var count = mach_msg_type_number_t(MemoryLayout<mach_task_basic_info>.size) / 4
    
    let result = withUnsafeMutablePointer(to: &info) {
        $0.withMemoryRebound(to: integer_t.self, capacity: 1) {
            task_info(mach_task_self_, task_flavor_t(MACH_TASK_BASIC_INFO), $0, &count)
        }
    }
    
    if result == KERN_SUCCESS {
        let usedMB = Double(info.resident_size) / 1024 / 1024
        print("\(label): \(String(format: "%.2f", usedMB)) MB")
    }
}

// Usage
logMemoryUsage("Before init")
let pipe = try await WhisperKit()
logMemoryUsage("After init")

let result = try await pipe.transcribe(audioPath: "audio.wav")
logMemoryUsage("After transcription")

Xcode Instruments

Open Instruments

Product → Profile (⌘I) in Xcode

Select Allocations

Choose “Allocations” template

Record Session

Run your app and transcribe audio

Analyze

Look for:

Peak memory usage
Memory growth over time
Allocation backtrace

Strategies for Large Models

Model Splitting

Load encoder and decoder separately:

import WhisperKit

// Load only encoder first
let encoderConfig = WhisperKitConfig(
    modelFolder: "models/large-v3",
    load: false  // Don't load automatically
)
let pipe = try await WhisperKit(encoderConfig)

// Manually load encoder
try await pipe.loadModels()  // Load encoder only

// Process audio
let features = try await pipe.audioEncoder.encodeFeatures(...)

// Later, load decoder when needed
try await pipe.loadTextDecoder()

Lazy Loading

Defer model loading until needed:

import WhisperKit

let config = WhisperKitConfig(
    model: "large-v3",
    load: false,      // Don't load on init
    download: true    // But download if missing
)

let pipe = try await WhisperKit(config)
// Models downloaded but not loaded

// Load when user triggers transcription
button.action = {
    try await pipe.loadModels()
    let result = try await pipe.transcribe(audioPath: "audio.wav")
}

Unloading Models

Free memory after transcription:

import WhisperKit

var pipe: WhisperKit? = try await WhisperKit()

// Use model
let result = try await pipe?.transcribe(audioPath: "audio.wav")

// Release memory
pipe = nil
// Or explicit unload if keeping reference
// pipe.unloadModels()  // If implemented

Concurrent Processing

Worker Count and Memory

More workers = more memory:

var options = DecodingOptions()

// High memory available
options.concurrentWorkerCount = 16  // macOS with >16GB RAM

// Medium memory
options.concurrentWorkerCount = 8   // macOS with 8-16GB RAM

// Low memory
options.concurrentWorkerCount = 4   // iOS devices

// Minimal memory
options.concurrentWorkerCount = 1   // Sequential processing

Each concurrent worker may hold its own audio buffer and KV cache state. Start with defaults and adjust based on memory pressure.

Sequential Processing

import WhisperKit

let pipe = try await WhisperKit(WhisperKitConfig(model: "large-v3"))

let audioFiles = ["file1.wav", "file2.wav", "file3.wav"]

// Process one at a time
for file in audioFiles {
    let result = try await pipe.transcribe(audioPath: file)
    print("\(file): \(result?.text ?? "")")
    
    // Optional: explicit cleanup
    // autoreleasepool { ... }
}

Audio Buffer Management

Chunking Strategy

VAD chunking reduces memory by processing smaller segments:

var options = DecodingOptions()
options.chunkingStrategy = .vad  // Voice Activity Detection

// VAD splits long audio into manageable chunks
// Smaller chunks = less memory per segment

Clip Timestamps

Manually segment long audio:

var options = DecodingOptions()
// Split 10-minute audio into 2-minute segments
options.clipTimestamps = [
    0.0,    // Start
    120.0,  // 2 min
    240.0,  // 4 min
    360.0,  // 6 min
    480.0,  // 8 min
    600.0   // 10 min (end)
]

let result = try await pipe.transcribe(
    audioPath: "long_audio.wav",
    decodeOptions: options
)

Platform-Specific Guidance

iOS Memory Limits

iOS has stricter memory limits than macOS:

iPhone 11 and earlier
iPhone 12-14
iPhone 15+

// Use tiny or base models only
let config = WhisperKitConfig(
    model: "tiny",
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuAndGPU,
        textDecoderCompute: .cpuOnly
    ),
    prewarm: true
)

// Small models work well
let config = WhisperKitConfig(
    model: "small",
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuAndNeuralEngine,
        textDecoderCompute: .cpuAndNeuralEngine
    ),
    prewarm: true
)

// Can handle medium models
let config = WhisperKitConfig(
    model: "medium",
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuAndNeuralEngine,
        textDecoderCompute: .cpuAndNeuralEngine
    ),
    prewarm: false  // Sufficient memory
)

macOS Memory Guidelines

8 GB RAM

M1/M2 Base

Recommended: small or distil*medium
Max: medium with prewarming
Avoid: large-v3 (may cause swapping)

16 GB RAM

M1 Pro/M2 Pro

Recommended: medium or distil*large-v3
Max: large-v3 comfortably
Concurrent workers: 8-16

32+ GB RAM

M1 Max/M2 Max/M3 Max

Use any model including large-v3
Multiple instances possible
Concurrent workers: 16+

Background Execution

App Extensions

// In app extension (widget, share extension, etc.)
import WhisperKit

// Use smallest model and enable prewarming
let config = WhisperKitConfig(
    model: "tiny",
    computeOptions: ModelComputeOptions(
        melCompute: .cpuOnly,
        audioEncoderCompute: .cpuOnly,
        textDecoderCompute: .cpuOnly,
        prefillCompute: .cpuOnly
    ),
    prewarm: true
)

let pipe = try await WhisperKit(config)

Background Tasks

import BackgroundTasks
import WhisperKit

func scheduleTranscription() {
    let request = BGProcessingTaskRequest(identifier: "com.app.transcribe")
    request.requiresNetworkConnectivity = false
    request.requiresExternalPower = false  // Battery-friendly
    
    try? BGTaskScheduler.shared.submit(request)
}

BGTaskScheduler.shared.register(forTaskWithIdentifier: "com.app.transcribe") { task in
    task.expirationHandler = {
        // Clean up
        pipe = nil
    }
    
    Task {
        // Use small model for background
        let pipe = try await WhisperKit(WhisperKitConfig(
            model: "tiny",
            prewarm: true
        ))
        
        let result = try await pipe.transcribe(audioPath: "audio.wav")
        // Save result
        
        task.setTaskCompleted(success: true)
    }
}

Troubleshooting Memory Issues

App crashes on model load

Solutions:

Enable prewarming: prewarm: true
Use smaller model: tiny or base
Switch to CPU-only: computeOptions with .cpuOnly
Close other apps to free memory

Memory warnings during transcription

Solutions:

Reduce concurrent workers: concurrentWorkerCount = 1
Enable VAD chunking: chunkingStrategy = .vad
Process files sequentially instead of in parallel
Use clip timestamps to segment long audio

Slow performance after first transcription

May be memory pressure causing throttling:

Monitor with Xcode Instruments
Explicitly release unused references
Consider smaller model or reduced worker count

Models recompiling frequently

Cache being evicted:

Check available disk space (cache requires ~2x model size)
Verify OS version (cache behavior varies)
Consider bundling pre-compiled models (advanced)

Best Practices Summary

Profile first

Use Xcode Instruments to establish baseline memory usage

Choose appropriate model

Match model size to device capabilities and requirements

Enable prewarming on mobile

Always use prewarm: true on iOS devices

Optimize compute units

Balance performance and memory with appropriate compute units

Limit concurrency

Don’t exceed recommended worker counts for platform

Use chunking

Enable VAD for long audio files

Monitor in production

Log memory metrics and watch for pressure warnings

Test on oldest devices

Ensure app works on minimum supported hardware

Next Steps

Performance Optimization

Optimize speed and quality

Custom Models

Deploy optimized custom models

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Documentation Index

​Understanding Memory Usage

Model Weights

KV Cache

Audio Buffers

​Model Size Reference

​Model Prewarming

​What is Prewarming?

​Configuration

​Prewarming Workflow

​Trade-offs

​When to Use Prewarming

​CoreML Model Cache

​Cache Location

​Checking Cache Status

​Prefilling KV Cache

​Resource Allocation

​Compute Units and Memory

​Optimizing Compute Units for Memory

​Memory Monitoring

​Runtime Memory Tracking

​Xcode Instruments

​Strategies for Large Models

​Model Splitting

​Lazy Loading

​Unloading Models

​Concurrent Processing

​Worker Count and Memory

​Sequential Processing

​Audio Buffer Management

​Chunking Strategy

​Clip Timestamps

​Platform-Specific Guidance

​iOS Memory Limits

​macOS Memory Guidelines

​Background Execution

​App Extensions

​Background Tasks

​Troubleshooting Memory Issues

​Best Practices Summary

​Next Steps

Performance Optimization

Custom Models

Build docs developers (and LLMs) love

Understanding Memory Usage

Model Size Reference

Model Prewarming

What is Prewarming?

Configuration

Prewarming Workflow

Trade-offs

When to Use Prewarming

CoreML Model Cache

Cache Location

Checking Cache Status

Prefilling KV Cache

Resource Allocation

Compute Units and Memory

Optimizing Compute Units for Memory

Memory Monitoring

Runtime Memory Tracking

Xcode Instruments

Strategies for Large Models

Model Splitting

Lazy Loading

Unloading Models

Concurrent Processing

Worker Count and Memory

Sequential Processing

Audio Buffer Management

Chunking Strategy

Clip Timestamps

Platform-Specific Guidance

iOS Memory Limits

macOS Memory Guidelines

Background Execution

App Extensions

Background Tasks

Troubleshooting Memory Issues

Best Practices Summary

Next Steps