Overview
TheWhisperKit class is the main entry point for performing speech-to-text transcription using Apple’s Core ML framework. It manages model loading, audio processing, and provides both synchronous and asynchronous transcription methods.
Class Definition
Initializers
init(_:)
Initializes WhisperKit with a configuration object.Configuration object for WhisperKit initialization. See WhisperKitConfig for details.
Convenience Initializer
Initializes WhisperKit with individual parameters.Name of the Whisper model variant to use (e.g., “tiny”, “base”, “small”, “medium”, “large”)
Base URL for downloading models
Repository name for downloading models (default: “argmaxinc/whisperkit-coreml”)
Local folder path containing pre-downloaded models
Folder containing tokenizer files
Options for ML compute units (CPU, GPU, Neural Engine)
Custom audio processor implementation
Custom feature extractor implementation
Custom audio encoder implementation
Custom text decoder implementation
Array of logits filters to apply during decoding
Custom segment seeker implementation
Enable verbose logging
Maximum log level to display
Enable model prewarming to reduce peak memory during initialization
Whether to load models immediately
Download models if not available locally
Use background download session for model downloads
Properties
Model State
Currently loaded model variant (tiny, base, small, medium, large, etc.)
Current state of the model (unloaded, loading, loaded, prewarming, etc.)
Compute options for the loaded models
The tokenizer used for encoding/decoding text
Processing Components
Audio processor for handling audio input and preprocessing
Feature extractor for converting audio to mel spectrograms
Audio encoder for encoding mel spectrograms to embeddings
Text decoder for generating text from audio embeddings
Segment seeker for managing audio window processing
Optional voice activity detector for chunking audio
Configuration
Configuration for audio input processing
Path to the folder containing model files
Path to the folder containing tokenizer files
Progress and Callbacks
Timing information for the current/last transcription
Progress object for tracking transcription progress
Callback invoked when new transcription segments are discovered
Callback invoked when model state changes
Callback invoked when transcription state changes
Constants
Sample rate used for audio processing (16 kHz)
Hop length for mel spectrogram computation
Duration in seconds represented by each time token (20ms)
Static Methods
deviceName()
Returns the device identifier string.Device identifier (e.g., “iPhone15,2”)
recommendedModels()
Returns recommended models for the current device.Model support information including default and supported model variants
recommendedRemoteModels(from:downloadBase:token:remoteConfigName:endpoint:)
Fetches recommended models from a remote repository.Repository to fetch model configuration from
Base URL for downloads
Authentication token for the repository
Name of the remote configuration file
API endpoint for the repository
Model support information from the remote repository
fetchAvailableModels(from:matching:downloadBase:token:remoteConfigName:endpoint:)
Fetches list of available models from a remote repository.Repository to fetch models from
Glob patterns to filter model names
Array of available model names
download(variant:downloadBase:useBackgroundSession:from:token:endpoint:progressCallback:)
Downloads a specific model variant.Model variant to download (e.g., “tiny”, “base”, “small”)
Optional callback for download progress updates
Local URL of the downloaded model folder
Instance Methods
loadModels(prewarmMode:)
Loads the models into memory.If true, loads models in prewarm mode to reduce peak memory usage
prewarmModels()
Prewarms the models by loading them sequentially.unloadModels()
Unloads all models from memory.clearState()
Clears the current transcription state.detectLanguage(audioPath:)
Detects the language of audio from a file path.Path to the audio file
Tuple containing detected language code and probability distribution over all languages
detectLangauge(audioArray:)
Detects the language of audio from sample array.Array of 16kHz audio samples
Tuple containing detected language code and probability distribution
transcribe(audioPath:decodeOptions:callback:)
Transcribes audio from a file path.Path to the audio file to transcribe
Options for transcription (language, task, temperature, etc.)
Optional callback for progress updates during transcription
Array of transcription results. See TranscriptionResult for details.
transcribe(audioArray:decodeOptions:callback:segmentCallback:)
Transcribes audio from a sample array.Array of 16kHz mono audio samples
Options for transcription
Optional callback for progress updates
Optional callback invoked when segments are discovered
Array of transcription results
transcribe(audioPaths:decodeOptions:callback:)
Transcribes multiple audio files.Array of audio file paths to transcribe
Array of optional transcription result arrays (nil if transcription failed for that file)
transcribe(audioArrays:decodeOptions:callback:)
Transcribes multiple audio sample arrays.Array of audio sample arrays to transcribe
Array of optional transcription result arrays
loggingCallback(_:)
Sets a custom logging callback.Custom logging callback function