Skip to main content
Optimizing WhisperKit performance involves balancing speed, accuracy, and resource usage. This guide covers compute units, model selection, decoding options, and platform-specific optimizations.

Compute Units

CoreML models can target different hardware accelerators on Apple devices:

Neural Engine

Specialized ML accelerator (fastest, most efficient)

GPU

Graphics processor (good balance)

CPU

Central processor (most compatible)

ModelComputeOptions

Configure compute units for each model component:
import WhisperKit
import CoreML

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,              // Mel spectrogram extraction
    audioEncoderCompute: .cpuAndNeuralEngine,  // Audio encoder
    textDecoderCompute: .cpuAndNeuralEngine,   // Text decoder
    prefillCompute: .cpuOnly             // KV cache prefill
)

let config = WhisperKitConfig(
    model: "large-v3",
    computeOptions: computeOptions
)

let pipe = try await WhisperKit(config)

Available Compute Units

.cpuOnly
MLComputeUnits
CPU only - most compatible, slowest
.cpuAndGPU
MLComputeUnits
CPU and GPU - good for macOS < 14
.cpuAndNeuralEngine
MLComputeUnits
CPU and Neural Engine - recommended for macOS 14+, iOS 17+
.all
MLComputeUnits
All available compute units - lets CoreML decide

CLI Configuration

Set compute units via command-line:
swift run whisperkit-cli transcribe \
  --audio-path "audio.wav" \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-encoder-compute-units cpuAndNeuralEngine \
  --text-decoder-compute-units cpuAndNeuralEngine

Model Selection

Model size significantly impacts speed and accuracy:
tiny
39M parameters
Speed: ~32x real-time | WER: ~10-15% | Size: ~75 MBBest for: Real-time applications, low-end devices, quick prototyping
base
74M parameters
Speed: ~16x real-time | WER: ~8-12% | Size: ~142 MBBest for: Balanced speed/quality, general transcription
small
244M parameters
Speed: ~6x real-time | WER: ~5-8% | Size: ~466 MBBest for: Production applications requiring accuracy
medium
769M parameters
Speed: ~2x real-time | WER: ~4-6% | Size: ~1.5 GBBest for: High-accuracy requirements, offline processing
large-v3
1550M parameters
Speed: ~1x real-time | WER: ~3-5% | Size: ~3 GBBest for: Maximum accuracy, multilingual, post-processing

Distil Models

Distilled models offer 2-3x speedup with minimal accuracy loss:
let config = WhisperKitConfig(
    model: "distil*large-v3",  // Glob pattern
    modelRepo: "argmaxinc/whisperkit-coreml"
)
Benchmarks: WhisperKit Benchmarks

Decoding Options

Temperature and Sampling

Control randomness and diversity:
var options = DecodingOptions()
options.temperature = 0.0  // Deterministic (default)
options.topK = 5           // Top-K sampling candidates
temperature
Float
default:"0.0"
  • 0.0: Deterministic (always pick most likely token)
  • 0.0 - 1.0: Increasing randomness
  • Higher values = more creative but less accurate
topK
Int
default:"5"
Number of top candidates to sample from when temperature > 0

Fallback Strategy

Automatically retry failed segments:
var options = DecodingOptions()
options.temperatureIncrementOnFallback = 0.2  // Increment per retry
options.temperatureFallbackCount = 5          // Max retries

Quality Thresholds

Detect and reject poor transcriptions:
var options = DecodingOptions()
options.compressionRatioThreshold = 2.4       // Detect repetition
options.logProbThreshold = -1.0               // Average confidence
options.firstTokenLogProbThreshold = -1.5     // First token confidence
options.noSpeechThreshold = 0.6               // Silence detection
Lower thresholds = more rejections = better quality but longer processingDefault values work well for most use cases.

Timestamp Control

var options = DecodingOptions()
options.withoutTimestamps = false      // Include timestamps
options.wordTimestamps = true          // Word-level timing
options.maxInitialTimestamp = 1.0      // First timestamp limit
Word timestamps require ~10-15% more processing time.

Parallel Processing

Concurrent Workers

Process multiple audio segments in parallel:
var options = DecodingOptions()

#if os(macOS)
options.concurrentWorkerCount = 16  // Default on macOS
#else
options.concurrentWorkerCount = 4   // Default on iOS/iPadOS
#endif
iOS devices show regression with >4 workers. macOS handles 16+ workers efficiently.

Chunking Strategy

var options = DecodingOptions()
options.chunkingStrategy = .vad  // Voice Activity Detection
// or
options.chunkingStrategy = .none  // Process entire audio
Voice Activity Detection (VAD):
  • Automatically splits audio at silence
  • Reduces unnecessary processing
  • Better for long audio with pauses
  • Slightly slower initialization
No Chunking:
  • Process entire audio file
  • Faster for short clips
  • May hit token limits on very long audio

Prefill Optimization

KV Cache Prefill

Accelerate initial decoding with cached key-value pairs:
var options = DecodingOptions()
options.usePrefillCache = true   // Use prefilled KV cache (faster)
options.usePrefillPrompt = true  // Force initial prompt tokens
Prefill reduces first-token latency by 2-3x but requires compatible models with prefill data.

Language and Task Prefill

var options = DecodingOptions(
    task: .transcribe,
    language: "en",
    usePrefillPrompt: true,
    usePrefillCache: true
)
Automatic prefill when:
  • Language is specified
  • Task is .translate
  • Custom prompt tokens provided

Memory Management

Model Prewarming

Reduce peak memory during model loading:
let config = WhisperKitConfig(
    model: "large-v3",
    prewarm: true  // Load-unload-load pattern
)
CoreML models need “specialization” on first load (compiling for your device). This specialized cache is maintained by Apple but evicted after OS updates.With prewarm=true:
  • Models loaded sequentially
  • Each model unloaded after specialization
  • Lower peak memory (1 model at a time)
  • 2x longer load time if cache is hit
With prewarm=false (default):
  • Models loaded in parallel
  • Higher peak memory (all models + compilation)
  • Faster load time when cache is hit
Enable when: Minimizing peak memory is critical (older devices, background apps)Disable when: Load time is critical (real-time apps, foreground processing)
let config = WhisperKitConfig(
    model: "large-v3",
    prewarm: true,
    verbose: true  // See timing breakdown
)

let pipe = try await WhisperKit(config)
// Logs: prewarmLoadTime, modelLoading, encoderLoadTime, decoderLoadTime
See Memory Management for detailed strategies.

Performance Metrics

Real-Time Factor

let result = try await pipe.transcribe(audioPath: "audio.wav")
let rtf = result?.timings.realTimeFactor ?? 0

print("Real-time factor: \(rtf)")
// 0.25 = 4x faster than real-time
// 1.0 = same speed as audio duration
// 2.0 = 2x slower than real-time

Tokens Per Second

let tps = result?.timings.tokensPerSecond ?? 0
print("Tokens per second: \(tps)")
// Higher is better (typically 50-200 for large models)

Speed Factor

let speedFactor = result?.timings.speedFactor ?? 0
print("Speed factor: \(speedFactor)x")
// Inverse of real-time factor
// 4.0 = 4x faster than real-time

Platform-Specific Tips

macOS

M1/M2/M3 Optimization

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,
    audioEncoderCompute: .cpuAndNeuralEngine,
    textDecoderCompute: .cpuAndNeuralEngine,
    prefillCompute: .cpuOnly
)

var decodingOptions = DecodingOptions()
decodingOptions.concurrentWorkerCount = 16
decodingOptions.chunkingStrategy = .vad

iOS/iPadOS

iPhone/iPad Optimization

// Use smaller models on mobile
let config = WhisperKitConfig(
    model: "small",  // or "distil*small"
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuAndNeuralEngine,
        textDecoderCompute: .cpuAndNeuralEngine
    )
)

var decodingOptions = DecodingOptions()
decodingOptions.concurrentWorkerCount = 4  // Don't exceed 4

watchOS

WhisperKit on watchOS requires tiny/base models with CPU-only compute units due to memory constraints.

Benchmarking

Measure performance on your target devices:
make benchmark-devices
View results: See BENCHMARKS.md for details.

Optimization Checklist

1

Choose the right model

Start with small for balance, tiny for speed, large-v3 for accuracy
2

Configure compute units

Use .cpuAndNeuralEngine on macOS 14+ and iOS 17+
3

Enable prefill caching

Set usePrefillCache = true for faster first-token generation
4

Adjust concurrent workers

16 on macOS, 4 on iOS for optimal parallelism
5

Use VAD chunking

Enable chunkingStrategy = .vad for long audio files
6

Tune quality thresholds

Lower thresholds for better quality, higher for speed
7

Monitor metrics

Track real-time factor and tokens per second
8

Test on target devices

Always benchmark on actual hardware

Next Steps

Memory Management

Advanced memory optimization strategies

Custom Models

Fine-tune models for your domain

Build docs developers (and LLMs) love