Core Protocols

WhisperKit’s transcription pipeline is built from modular, protocol-based components. This allows you to customize or replace individual parts of the pipeline while maintaining compatibility with the rest of the system.

AudioProcessing

Handles audio loading, recording, and preprocessing.

Required Properties

audioSamples

ContiguousArray<Float>

Stores the audio samples to be transcribed.

relativeEnergy

[Float]

A measure of current buffer’s energy in dB, normalized from 0-1 based on the quietest buffer’s energy in a specified window.

relativeEnergyWindow

Int

How many past buffers of audio to use when calculating relative energy. The lowest average energy value within this window is used as the silence baseline.

Required Methods

loadAudio(fromPath:channelMode:startTime:endTime:maxReadFrameSize:)

static func

Loads audio data from a specified file path.Parameters:

audioFilePath: String - The file path of the audio file
channelMode: ChannelMode - How to handle multi-channel audio
startTime: Double? - Optional start time in seconds
endTime: Double? - Optional end time in seconds
maxReadFrameSize: AVAudioFrameCount? - Maximum frames to read at once

Returns: AVAudioPCMBuffer containing the audio data

loadAudio(at:channelMode:)

static func

Loads and converts audio data from multiple file paths.Parameters:

audioPaths: [String] - Array of file paths
channelMode: ChannelMode - How to handle multi-channel audio

Returns: Array of Result<[Float], Error> for each file

padOrTrimAudio(fromArray:startAt:toLength:saveSegment:)

static func

Pads or trims audio data to the desired length.Parameters:

audioArray: [Float] - Audio frames to process
startIndex: Int - Index to start at
frameLength: Int - Desired length in frames
saveSegment: Bool - Whether to save for debugging

Returns: MLMultiArray? containing the processed audio

padOrTrim(fromArray:startAt:toLength:)

func

Instance method to pad or trim audio data.Returns: AudioProcessorOutputType?

purgeAudioSamples(keepingLast:)

func

Empties the audio samples array, keeping the last N samples.

startRecordingLive(inputDeviceID:callback:)

func

Starts recording audio from the specified input device, resetting previous state.Parameters:

inputDeviceID: DeviceID? - Input device (macOS only)
callback: (([Float]) -> Void)? - Called with each audio buffer

startStreamingRecordingLive(inputDeviceID:)

func

Starts live audio recording with an async stream.Returns: Tuple of AsyncThrowingStream<[Float], Error> and its continuation

pauseRecording()

func

Pauses the current recording.

stopRecording()

func

Stops recording and cleans up resources.

resumeRecordingLive(inputDeviceID:callback:)

func

Resumes recording audio, appending to continuous audioArray after pause.

FeatureExtracting

Extracts mel spectrogram features from audio.

Properties

melCount

Int?

Number of mel frequency bins (typically 80 or 128).

windowSamples

Int?

Number of audio samples per window (typically 480,000 for 30 seconds at 16kHz).

Methods

logMelSpectrogram(fromAudio:)

async func

Converts audio samples to log mel spectrogram features.Parameters:

inputAudio: AudioProcessorOutputType - Processed audio samples

Returns: FeatureExtractorOutputType? - Mel spectrogram featuresThrows: WhisperError if extraction fails

AudioEncoding

Encodes audio features into embeddings.

Properties

embedSize

Int?

Size of the embedding dimension produced by the encoder.

Methods

encodeFeatures(_:)

async func

Encodes audio features into embeddings.Parameters:

features: FeatureExtractorOutputType - Mel spectrogram features

Returns: AudioEncoderOutputType? - Encoded audio embeddingsThrows: WhisperError if encoding fails

TextDecoding

Decodes audio embeddings into text.

Properties

tokenizer

WhisperTokenizer?

Tokenizer for encoding/decoding text.

prefillData

WhisperMLModel?

Optional prefill model for KV cache initialization.

isModelMultilingual

Bool

Whether the model supports multiple languages.

supportsWordTimestamps

Bool

Whether the model can generate word-level timestamps.

logitsSize

Int?

Size of the vocabulary (number of possible tokens).

logitsFilters

[LogitsFiltering]?

Array of filters applied to logits before sampling.

kvCacheEmbedDim

Int?

Embedding dimension for key-value cache.

kvCacheMaxSequenceLength

Int?

Maximum sequence length for KV cache.

windowSize

Int?

Size of the attention window.

embedSize

Int?

Size of encoder output embeddings.

Methods

predictLogits(_:)

async func

Predicts logits for the next token.Parameters:

inputs: TextDecoderInputType - Decoder inputs including tokens and caches

Returns: TextDecoderOutputType? - Logits and updated caches

prepareDecoderInputs(withPrompt:)

func

Prepares decoder inputs with an initial prompt.Parameters:

initialPrompt: [Int] - Array of prompt token IDs

Returns: DecodingInputsType - Initialized decoder inputsThrows: WhisperError if preparation fails

prefillDecoderInputs(_:withOptions:)

async func

Prefills decoder inputs with language and task tokens.Parameters:

decoderInputs: DecodingInputsType - Inputs to prefill
options: DecodingOptions? - Decoding configuration

Returns: DecodingInputsType - Prefilled inputs

prefillKVCache(withTask:andLanguage:)

async func

Prefills the key-value cache using the prefill model.Parameters:

task: MLMultiArray - Task token (transcribe/translate)
language: MLMultiArray - Language token

Returns: DecodingCache? - Prefilled cache data

decodeText(from:using:sampler:options:callback:)

async func

Decodes audio embeddings into text.Parameters:

encoderOutput: AudioEncoderOutputType - Encoded audio
decoderInputs: DecodingInputsType - Decoder state
tokenSampler: TokenSampling - Token sampling strategy
decoderOptions: DecodingOptions - Decoding configuration
callback: TranscriptionCallback - Progress callback

Returns: DecodingResult - Decoded text and metadata

detectLanguage(from:using:sampler:options:temperature:)

async func

Detects the language of the audio.Parameters:

encoderOutput: AudioEncoderOutputType - Encoded audio
decoderInputs: DecodingInputsType - Decoder state
tokenSampler: TokenSampling - Token sampling strategy
options: DecodingOptions - Decoding configuration
temperature: FloatType - Sampling temperature

Returns: DecodingResult - Detected language and probabilities

updateKVCache(keyTensor:keySlice:valueTensor:valueSlice:insertAtIndex:)

static func

Updates the key-value cache with new values.Parameters:

keyTensor: MLMultiArray - Key cache tensor
keySlice: MLMultiArray - New key values
valueTensor: MLMultiArray - Value cache tensor
valueSlice: MLMultiArray - New value values
index: Int - Position to insert

LogitsFiltering

Filters model logits before token sampling.

Methods

filterLogits(_:withTokens:)

func

Filters the logits based on current tokens and rules.Parameters:

logits: MLMultiArray - Raw model logits
tokens: [Int] - Currently generated tokens

Returns: MLMultiArray - Filtered logits

Built-in Filters

SuppressTokensFilter - Suppresses specific token IDs
SuppressBlankFilter - Suppresses blank tokens at segment start
TimestampRulesFilter - Enforces timestamp pairing rules
LanguageLogitsFilter - Retains only language tokens

SegmentSeeking

Manages audio segmentation and word-level timestamps.

Methods

findSeekPointAndSegments(decodingResult:options:allSegmentsCount:currentSeek:segmentSize:sampleRate:timeToken:specialToken:tokenizer:)

func

Finds the next seek point and creates transcription segments.Returns: Tuple of (Int, [TranscriptionSegment]?) - next seek position and segments

addWordTimestamps(segments:alignmentWeights:tokenizer:seek:segmentSize:prependPunctuations:appendPunctuations:lastSpeechTimestamp:options:timings:)

func

Adds word-level timestamps to segments using alignment weights.Returns: [TranscriptionSegment]? - Segments with word timestampsThrows: WhisperError if timestamp alignment fails

WhisperTokenizer

Tokenizes and detokenizes text for Whisper models.

Properties

specialTokens

SpecialTokens

Special token IDs used by the model (start, end, language tokens, etc.).

allLanguageTokens

Set<Int>

Set of all language token IDs supported by the model.

Methods

encode(text:)

func

Encodes text into token IDs.Returns: [Int] - Array of token IDs

decode(tokens:)

func

Decodes token IDs into text.Returns: String - Decoded text

convertTokenToId(_:)

func

Converts a token string to its ID.Returns: Int? - Token ID, or nil if not found

convertIdToToken(_:)

func

Converts a token ID to its string representation.Returns: String? - Token string, or nil if not found

splitToWordTokens(tokenIds:)

func

Splits token IDs into words and their constituent tokens.Returns: Tuple of (words: [String], wordTokens: [[Int]])

WhisperMLModel

Base protocol for Core ML model wrappers.

Properties

model

MLModel?

The underlying Core ML model instance.

Methods

loadModel(at:computeUnits:prewarmMode:)

async func

Loads a Core ML model from disk.Parameters:

modelPath: URL - Path to the .mlmodelc file
computeUnits: MLComputeUnits - Compute units to use
prewarmMode: Bool - Whether to load in prewarm mode

unloadModel()

func

Unloads the model from memory.

Usage Example

// Custom audio processor
class MyAudioProcessor: AudioProcessing {
    var audioSamples: ContiguousArray<Float> = []
    var relativeEnergy: [Float] = []
    var relativeEnergyWindow: Int = 20
    
    // Implement required methods...
}

// Use custom processor
let config = WhisperKitConfig(
    model: "openai_whisper-base",
    audioProcessor: MyAudioProcessor()
)

let whisperKit = try await WhisperKit(config)

WhisperKit

TTSKit

Core Types

AudioProcessing

Required Properties

Required Methods

FeatureExtracting

Properties

Methods

AudioEncoding

Properties

Methods

TextDecoding

Properties

Methods

LogitsFiltering

Methods

Built-in Filters

SegmentSeeking

Methods

WhisperTokenizer

Properties

Methods

WhisperMLModel

Properties

Methods

Usage Example

Build docs developers (and LLMs) love

WhisperKit

TTSKit

Core Types

Documentation Index

​AudioProcessing

​Required Properties

​Required Methods

​FeatureExtracting

​Properties

​Methods

​AudioEncoding

​Properties

​Methods

​TextDecoding

​Properties

​Methods

​LogitsFiltering

​Methods

​Built-in Filters

​SegmentSeeking

​Methods

​WhisperTokenizer

​Properties

​Methods

​WhisperMLModel

​Properties

​Methods

​Usage Example

​Related Types

Build docs developers (and LLMs) love

AudioProcessing

Required Properties

Required Methods

FeatureExtracting

Properties

Methods

AudioEncoding

Properties

Methods

TextDecoding

Properties

Methods

LogitsFiltering

Methods

Built-in Filters

SegmentSeeking

Methods

WhisperTokenizer

Properties

Methods

WhisperMLModel

Properties

Methods

Usage Example

Related Types