AudioProcessing
Handles audio loading, recording, and preprocessing.Required Properties
Stores the audio samples to be transcribed.
A measure of current buffer’s energy in dB, normalized from 0-1 based on the quietest buffer’s energy in a specified window.
How many past buffers of audio to use when calculating relative energy. The lowest average energy value within this window is used as the silence baseline.
Required Methods
Loads audio data from a specified file path.Parameters:
audioFilePath: String- The file path of the audio filechannelMode: ChannelMode- How to handle multi-channel audiostartTime: Double?- Optional start time in secondsendTime: Double?- Optional end time in secondsmaxReadFrameSize: AVAudioFrameCount?- Maximum frames to read at once
AVAudioPCMBuffer containing the audio dataLoads and converts audio data from multiple file paths.Parameters:
audioPaths: [String]- Array of file pathschannelMode: ChannelMode- How to handle multi-channel audio
Result<[Float], Error> for each filePads or trims audio data to the desired length.Parameters:
audioArray: [Float]- Audio frames to processstartIndex: Int- Index to start atframeLength: Int- Desired length in framessaveSegment: Bool- Whether to save for debugging
MLMultiArray? containing the processed audioInstance method to pad or trim audio data.Returns:
AudioProcessorOutputType?Empties the audio samples array, keeping the last N samples.
Starts recording audio from the specified input device, resetting previous state.Parameters:
inputDeviceID: DeviceID?- Input device (macOS only)callback: (([Float]) -> Void)?- Called with each audio buffer
Starts live audio recording with an async stream.Returns: Tuple of
AsyncThrowingStream<[Float], Error> and its continuationPauses the current recording.
Stops recording and cleans up resources.
Resumes recording audio, appending to continuous audioArray after pause.
FeatureExtracting
Extracts mel spectrogram features from audio.Properties
Number of mel frequency bins (typically 80 or 128).
Number of audio samples per window (typically 480,000 for 30 seconds at 16kHz).
Methods
Converts audio samples to log mel spectrogram features.Parameters:
inputAudio: AudioProcessorOutputType- Processed audio samples
FeatureExtractorOutputType? - Mel spectrogram featuresThrows: WhisperError if extraction failsAudioEncoding
Encodes audio features into embeddings.Properties
Size of the embedding dimension produced by the encoder.
Methods
Encodes audio features into embeddings.Parameters:
features: FeatureExtractorOutputType- Mel spectrogram features
AudioEncoderOutputType? - Encoded audio embeddingsThrows: WhisperError if encoding failsTextDecoding
Decodes audio embeddings into text.Properties
Tokenizer for encoding/decoding text.
Optional prefill model for KV cache initialization.
Whether the model supports multiple languages.
Whether the model can generate word-level timestamps.
Size of the vocabulary (number of possible tokens).
Array of filters applied to logits before sampling.
Embedding dimension for key-value cache.
Maximum sequence length for KV cache.
Size of the attention window.
Size of encoder output embeddings.
Methods
Predicts logits for the next token.Parameters:
inputs: TextDecoderInputType- Decoder inputs including tokens and caches
TextDecoderOutputType? - Logits and updated cachesPrepares decoder inputs with an initial prompt.Parameters:
initialPrompt: [Int]- Array of prompt token IDs
DecodingInputsType - Initialized decoder inputsThrows: WhisperError if preparation failsPrefills decoder inputs with language and task tokens.Parameters:
decoderInputs: DecodingInputsType- Inputs to prefilloptions: DecodingOptions?- Decoding configuration
DecodingInputsType - Prefilled inputsPrefills the key-value cache using the prefill model.Parameters:
task: MLMultiArray- Task token (transcribe/translate)language: MLMultiArray- Language token
DecodingCache? - Prefilled cache dataDecodes audio embeddings into text.Parameters:
encoderOutput: AudioEncoderOutputType- Encoded audiodecoderInputs: DecodingInputsType- Decoder statetokenSampler: TokenSampling- Token sampling strategydecoderOptions: DecodingOptions- Decoding configurationcallback: TranscriptionCallback- Progress callback
DecodingResult - Decoded text and metadataDetects the language of the audio.Parameters:
encoderOutput: AudioEncoderOutputType- Encoded audiodecoderInputs: DecodingInputsType- Decoder statetokenSampler: TokenSampling- Token sampling strategyoptions: DecodingOptions- Decoding configurationtemperature: FloatType- Sampling temperature
DecodingResult - Detected language and probabilitiesUpdates the key-value cache with new values.Parameters:
keyTensor: MLMultiArray- Key cache tensorkeySlice: MLMultiArray- New key valuesvalueTensor: MLMultiArray- Value cache tensorvalueSlice: MLMultiArray- New value valuesindex: Int- Position to insert
LogitsFiltering
Filters model logits before token sampling.Methods
Filters the logits based on current tokens and rules.Parameters:
logits: MLMultiArray- Raw model logitstokens: [Int]- Currently generated tokens
MLMultiArray - Filtered logitsBuilt-in Filters
- SuppressTokensFilter - Suppresses specific token IDs
- SuppressBlankFilter - Suppresses blank tokens at segment start
- TimestampRulesFilter - Enforces timestamp pairing rules
- LanguageLogitsFilter - Retains only language tokens
SegmentSeeking
Manages audio segmentation and word-level timestamps.Methods
findSeekPointAndSegments(decodingResult:options:allSegmentsCount:currentSeek:segmentSize:sampleRate:timeToken:specialToken:tokenizer:)
func
Finds the next seek point and creates transcription segments.Returns: Tuple of
(Int, [TranscriptionSegment]?) - next seek position and segmentsaddWordTimestamps(segments:alignmentWeights:tokenizer:seek:segmentSize:prependPunctuations:appendPunctuations:lastSpeechTimestamp:options:timings:)
func
Adds word-level timestamps to segments using alignment weights.Returns:
[TranscriptionSegment]? - Segments with word timestampsThrows: WhisperError if timestamp alignment failsWhisperTokenizer
Tokenizes and detokenizes text for Whisper models.Properties
Special token IDs used by the model (start, end, language tokens, etc.).
Set of all language token IDs supported by the model.
Methods
Encodes text into token IDs.Returns:
[Int] - Array of token IDsDecodes token IDs into text.Returns:
String - Decoded textConverts a token string to its ID.Returns:
Int? - Token ID, or nil if not foundConverts a token ID to its string representation.Returns:
String? - Token string, or nil if not foundSplits token IDs into words and their constituent tokens.Returns: Tuple of
(words: [String], wordTokens: [[Int]])WhisperMLModel
Base protocol for Core ML model wrappers.Properties
The underlying Core ML model instance.
Methods
Loads a Core ML model from disk.Parameters:
modelPath: URL- Path to the .mlmodelc filecomputeUnits: MLComputeUnits- Compute units to useprewarmMode: Bool- Whether to load in prewarm mode
Unloads the model from memory.