Overview
TheWhisperKitConfig class provides a comprehensive configuration interface for initializing WhisperKit. It allows you to customize model selection, download settings, compute options, audio processing, and logging behavior.
Class Definition
Initializer
Properties
Model Selection
Name of the Whisper model variant to use. Options include:
"tiny"- Smallest, fastest model"tiny.en"- English-only tiny model"base"- Base model"base.en"- English-only base model"small"- Small model"small.en"- English-only small model"medium"- Medium model"medium.en"- English-only medium model"large"- Large model (multilingual)"large-v2"- Large v2 model"large-v3"- Large v3 model
Model Download Configuration
Base URL for downloading models. If not specified, uses the default Hugging Face Hub endpoint.
Repository identifier for downloading models. Default is
"argmaxinc/whisperkit-coreml".Authentication token for accessing the model repository, if required.
Custom Hugging Face Hub compatible endpoint URL for model downloads.
Local file system path to a folder containing pre-downloaded model files. If specified, WhisperKit will use these models instead of downloading.
Local file system path to a folder containing tokenizer files.
Compute Configuration
Configuration for ML compute units. Allows you to specify which hardware (CPU, GPU, Neural Engine) to use for each model component:
Audio Configuration
Configuration for audio input processing, including channel mode settings:
Custom Components
Custom audio processor implementation. If not provided, uses the default
AudioProcessor.Custom feature extractor implementation for converting audio to mel spectrograms.
Custom audio encoder implementation for encoding mel spectrograms to embeddings.
Custom text decoder implementation for generating text from embeddings.
Array of custom logits filters to apply during text decoding.
Custom segment seeker implementation for managing audio window processing.
Voice activity detector for intelligent audio chunking. See VoiceActivityDetector for details.
Logging Configuration
Enable verbose logging output for debugging and monitoring.
Maximum log level to display. Options:
.debug- Detailed debug information.info- General information.error- Only errors.none- No logging
Performance Configuration
Enable model prewarming for reduced peak memory usage during initialization.What is Prewarming?WhisperKit uses Core ML models that need to be “specialized” to your device’s chip before use. This specialization happens automatically on first load. The resulting specialized models are cached by Core ML.Trade-offs:
- ✅ Pro: Reduces peak memory usage by loading models sequentially
- ❌ Con: Doubles load time (~2x) when cache is hit and specialization isn’t needed
- Enable when minimizing peak memory is critical
- Disable if you cannot afford 2x longer load time
Whether to load models immediately during initialization. If
nil, models are loaded automatically when modelFolder is provided.Download Configuration
Whether to download models automatically if they are not available locally.
Use a background URLSession for model downloads. Useful for downloading large models that may take significant time.
Example Usage
Basic Configuration
Specify Model
Local Model
Custom Compute Options
With Voice Activity Detection
Memory-Optimized Configuration
Custom Repository
Multi-Channel Audio
DecodingOptions
WhileWhisperKitConfig configures the WhisperKit instance, DecodingOptions configures individual transcription requests.
Common Parameters
Display decoding progress and details
Either
.transcribe (X→X) or .translate (X→English)Language code (e.g., “en”, “es”, “fr”). If nil, language is auto-detected.
Sampling temperature (0.0 = deterministic, higher = more random)
Temperature increment when decoding fails and needs retry
Maximum number of temperature fallback attempts
Disable timestamp prediction
Enable word-level timestamps (requires more computation)
Array of timestamps (in seconds) to split audio into segments
Token IDs to use as conditioning prompt
Suppress blank tokens during decoding
Threshold for detecting repetitive text (triggers fallback)
Minimum average log probability (triggers fallback if below)
Probability threshold for detecting silence
Number of concurrent workers for parallel transcription (default: 16 on macOS, 4 on iOS)
Strategy for chunking long audio:
.none- No chunking.vad- Voice activity detection based chunking