Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/argmaxinc/WhisperKit/llms.txt

Use this file to discover all available pages before exploring further.

Optimizing WhisperKit performance involves balancing speed, accuracy, and resource usage. This guide covers compute units, model selection, decoding options, and platform-specific optimizations.

Compute Units

CoreML models can target different hardware accelerators on Apple devices:

Neural Engine

Specialized ML accelerator (fastest, most efficient)

GPU

Graphics processor (good balance)

CPU

Central processor (most compatible)

ModelComputeOptions

Configure compute units for each model component:
import WhisperKit
import CoreML

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,              // Mel spectrogram extraction
    audioEncoderCompute: .cpuAndNeuralEngine,  // Audio encoder
    textDecoderCompute: .cpuAndNeuralEngine,   // Text decoder
    prefillCompute: .cpuOnly             // KV cache prefill
)

let config = WhisperKitConfig(
    model: "large-v3",
    computeOptions: computeOptions
)

let pipe = try await WhisperKit(config)

Available Compute Units

.cpuOnly
MLComputeUnits
CPU only - most compatible, slowest
.cpuAndGPU
MLComputeUnits
CPU and GPU - good for macOS < 14
.cpuAndNeuralEngine
MLComputeUnits
CPU and Neural Engine - recommended for macOS 14+, iOS 17+
.all
MLComputeUnits
All available compute units - lets CoreML decide

CLI Configuration

Set compute units via command-line:
swift run whisperkit-cli transcribe \
  --audio-path "audio.wav" \
  --model-path "Models/whisperkit-coreml/openai_whisper-large-v3" \
  --audio-encoder-compute-units cpuAndNeuralEngine \
  --text-decoder-compute-units cpuAndNeuralEngine

Model Selection

Model size significantly impacts speed and accuracy:
tiny
39M parameters
Speed: ~32x real-time | WER: ~10-15% | Size: ~75 MBBest for: Real-time applications, low-end devices, quick prototyping
base
74M parameters
Speed: ~16x real-time | WER: ~8-12% | Size: ~142 MBBest for: Balanced speed/quality, general transcription
small
244M parameters
Speed: ~6x real-time | WER: ~5-8% | Size: ~466 MBBest for: Production applications requiring accuracy
medium
769M parameters
Speed: ~2x real-time | WER: ~4-6% | Size: ~1.5 GBBest for: High-accuracy requirements, offline processing
large-v3
1550M parameters
Speed: ~1x real-time | WER: ~3-5% | Size: ~3 GBBest for: Maximum accuracy, multilingual, post-processing

Distil Models

Distilled models offer 2-3x speedup with minimal accuracy loss:
let config = WhisperKitConfig(
    model: "distil*large-v3",  // Glob pattern
    modelRepo: "argmaxinc/whisperkit-coreml"
)
Benchmarks: WhisperKit Benchmarks

Decoding Options

Temperature and Sampling

Control randomness and diversity:
var options = DecodingOptions()
options.temperature = 0.0  // Deterministic (default)
options.topK = 5           // Top-K sampling candidates
temperature
Float
default:"0.0"
  • 0.0: Deterministic (always pick most likely token)
  • 0.0 - 1.0: Increasing randomness
  • Higher values = more creative but less accurate
topK
Int
default:"5"
Number of top candidates to sample from when temperature > 0

Fallback Strategy

Automatically retry failed segments:
var options = DecodingOptions()
options.temperatureIncrementOnFallback = 0.2  // Increment per retry
options.temperatureFallbackCount = 5          // Max retries

Quality Thresholds

Detect and reject poor transcriptions:
var options = DecodingOptions()
options.compressionRatioThreshold = 2.4       // Detect repetition
options.logProbThreshold = -1.0               // Average confidence
options.firstTokenLogProbThreshold = -1.5     // First token confidence
options.noSpeechThreshold = 0.6               // Silence detection
Lower thresholds = more rejections = better quality but longer processingDefault values work well for most use cases.

Timestamp Control

var options = DecodingOptions()
options.withoutTimestamps = false      // Include timestamps
options.wordTimestamps = true          // Word-level timing
options.maxInitialTimestamp = 1.0      // First timestamp limit
Word timestamps require ~10-15% more processing time.

Parallel Processing

Concurrent Workers

Process multiple audio segments in parallel:
var options = DecodingOptions()

#if os(macOS)
options.concurrentWorkerCount = 16  // Default on macOS
#else
options.concurrentWorkerCount = 4   // Default on iOS/iPadOS
#endif
iOS devices show regression with >4 workers. macOS handles 16+ workers efficiently.

Chunking Strategy

var options = DecodingOptions()
options.chunkingStrategy = .vad  // Voice Activity Detection
// or
options.chunkingStrategy = .none  // Process entire audio
Voice Activity Detection (VAD):
  • Automatically splits audio at silence
  • Reduces unnecessary processing
  • Better for long audio with pauses
  • Slightly slower initialization
No Chunking:
  • Process entire audio file
  • Faster for short clips
  • May hit token limits on very long audio

Prefill Optimization

KV Cache Prefill

Accelerate initial decoding with cached key-value pairs:
var options = DecodingOptions()
options.usePrefillCache = true   // Use prefilled KV cache (faster)
options.usePrefillPrompt = true  // Force initial prompt tokens
Prefill reduces first-token latency by 2-3x but requires compatible models with prefill data.

Language and Task Prefill

var options = DecodingOptions(
    task: .transcribe,
    language: "en",
    usePrefillPrompt: true,
    usePrefillCache: true
)
Automatic prefill when:
  • Language is specified
  • Task is .translate
  • Custom prompt tokens provided

Memory Management

Model Prewarming

Reduce peak memory during model loading:
let config = WhisperKitConfig(
    model: "large-v3",
    prewarm: true  // Load-unload-load pattern
)
CoreML models need “specialization” on first load (compiling for your device). This specialized cache is maintained by Apple but evicted after OS updates.With prewarm=true:
  • Models loaded sequentially
  • Each model unloaded after specialization
  • Lower peak memory (1 model at a time)
  • 2x longer load time if cache is hit
With prewarm=false (default):
  • Models loaded in parallel
  • Higher peak memory (all models + compilation)
  • Faster load time when cache is hit
Enable when: Minimizing peak memory is critical (older devices, background apps)Disable when: Load time is critical (real-time apps, foreground processing)
let config = WhisperKitConfig(
    model: "large-v3",
    prewarm: true,
    verbose: true  // See timing breakdown
)

let pipe = try await WhisperKit(config)
// Logs: prewarmLoadTime, modelLoading, encoderLoadTime, decoderLoadTime
See Memory Management for detailed strategies.

Performance Metrics

Real-Time Factor

let result = try await pipe.transcribe(audioPath: "audio.wav")
let rtf = result?.timings.realTimeFactor ?? 0

print("Real-time factor: \(rtf)")
// 0.25 = 4x faster than real-time
// 1.0 = same speed as audio duration
// 2.0 = 2x slower than real-time

Tokens Per Second

let tps = result?.timings.tokensPerSecond ?? 0
print("Tokens per second: \(tps)")
// Higher is better (typically 50-200 for large models)

Speed Factor

let speedFactor = result?.timings.speedFactor ?? 0
print("Speed factor: \(speedFactor)x")
// Inverse of real-time factor
// 4.0 = 4x faster than real-time

Platform-Specific Tips

macOS

M1/M2/M3 Optimization

let computeOptions = ModelComputeOptions(
    melCompute: .cpuAndGPU,
    audioEncoderCompute: .cpuAndNeuralEngine,
    textDecoderCompute: .cpuAndNeuralEngine,
    prefillCompute: .cpuOnly
)

var decodingOptions = DecodingOptions()
decodingOptions.concurrentWorkerCount = 16
decodingOptions.chunkingStrategy = .vad

iOS/iPadOS

iPhone/iPad Optimization

// Use smaller models on mobile
let config = WhisperKitConfig(
    model: "small",  // or "distil*small"
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuAndNeuralEngine,
        textDecoderCompute: .cpuAndNeuralEngine
    )
)

var decodingOptions = DecodingOptions()
decodingOptions.concurrentWorkerCount = 4  // Don't exceed 4

watchOS

WhisperKit on watchOS requires tiny/base models with CPU-only compute units due to memory constraints.

Benchmarking

Measure performance on your target devices:
make benchmark-devices
View results: See BENCHMARKS.md for details.

Optimization Checklist

1

Choose the right model

Start with small for balance, tiny for speed, large-v3 for accuracy
2

Configure compute units

Use .cpuAndNeuralEngine on macOS 14+ and iOS 17+
3

Enable prefill caching

Set usePrefillCache = true for faster first-token generation
4

Adjust concurrent workers

16 on macOS, 4 on iOS for optimal parallelism
5

Use VAD chunking

Enable chunkingStrategy = .vad for long audio files
6

Tune quality thresholds

Lower thresholds for better quality, higher for speed
7

Monitor metrics

Track real-time factor and tokens per second
8

Test on target devices

Always benchmark on actual hardware

Next Steps

Memory Management

Advanced memory optimization strategies

Custom Models

Fine-tune models for your domain

Build docs developers (and LLMs) love