Skip to main content
WhisperKit’s transcription pipeline is built from modular, protocol-based components. This allows you to customize or replace individual parts of the pipeline while maintaining compatibility with the rest of the system.

AudioProcessing

Handles audio loading, recording, and preprocessing.

Required Properties

audioSamples
ContiguousArray<Float>
Stores the audio samples to be transcribed.
relativeEnergy
[Float]
A measure of current buffer’s energy in dB, normalized from 0-1 based on the quietest buffer’s energy in a specified window.
relativeEnergyWindow
Int
How many past buffers of audio to use when calculating relative energy. The lowest average energy value within this window is used as the silence baseline.

Required Methods

loadAudio(fromPath:channelMode:startTime:endTime:maxReadFrameSize:)
static func
Loads audio data from a specified file path.Parameters:
  • audioFilePath: String - The file path of the audio file
  • channelMode: ChannelMode - How to handle multi-channel audio
  • startTime: Double? - Optional start time in seconds
  • endTime: Double? - Optional end time in seconds
  • maxReadFrameSize: AVAudioFrameCount? - Maximum frames to read at once
Returns: AVAudioPCMBuffer containing the audio data
loadAudio(at:channelMode:)
static func
Loads and converts audio data from multiple file paths.Parameters:
  • audioPaths: [String] - Array of file paths
  • channelMode: ChannelMode - How to handle multi-channel audio
Returns: Array of Result<[Float], Error> for each file
padOrTrimAudio(fromArray:startAt:toLength:saveSegment:)
static func
Pads or trims audio data to the desired length.Parameters:
  • audioArray: [Float] - Audio frames to process
  • startIndex: Int - Index to start at
  • frameLength: Int - Desired length in frames
  • saveSegment: Bool - Whether to save for debugging
Returns: MLMultiArray? containing the processed audio
padOrTrim(fromArray:startAt:toLength:)
func
Instance method to pad or trim audio data.Returns: AudioProcessorOutputType?
purgeAudioSamples(keepingLast:)
func
Empties the audio samples array, keeping the last N samples.
startRecordingLive(inputDeviceID:callback:)
func
Starts recording audio from the specified input device, resetting previous state.Parameters:
  • inputDeviceID: DeviceID? - Input device (macOS only)
  • callback: (([Float]) -> Void)? - Called with each audio buffer
startStreamingRecordingLive(inputDeviceID:)
func
Starts live audio recording with an async stream.Returns: Tuple of AsyncThrowingStream<[Float], Error> and its continuation
pauseRecording()
func
Pauses the current recording.
stopRecording()
func
Stops recording and cleans up resources.
resumeRecordingLive(inputDeviceID:callback:)
func
Resumes recording audio, appending to continuous audioArray after pause.

FeatureExtracting

Extracts mel spectrogram features from audio.

Properties

melCount
Int?
Number of mel frequency bins (typically 80 or 128).
windowSamples
Int?
Number of audio samples per window (typically 480,000 for 30 seconds at 16kHz).

Methods

logMelSpectrogram(fromAudio:)
async func
Converts audio samples to log mel spectrogram features.Parameters:
  • inputAudio: AudioProcessorOutputType - Processed audio samples
Returns: FeatureExtractorOutputType? - Mel spectrogram featuresThrows: WhisperError if extraction fails

AudioEncoding

Encodes audio features into embeddings.

Properties

embedSize
Int?
Size of the embedding dimension produced by the encoder.

Methods

encodeFeatures(_:)
async func
Encodes audio features into embeddings.Parameters:
  • features: FeatureExtractorOutputType - Mel spectrogram features
Returns: AudioEncoderOutputType? - Encoded audio embeddingsThrows: WhisperError if encoding fails

TextDecoding

Decodes audio embeddings into text.

Properties

tokenizer
WhisperTokenizer?
Tokenizer for encoding/decoding text.
prefillData
WhisperMLModel?
Optional prefill model for KV cache initialization.
isModelMultilingual
Bool
Whether the model supports multiple languages.
supportsWordTimestamps
Bool
Whether the model can generate word-level timestamps.
logitsSize
Int?
Size of the vocabulary (number of possible tokens).
logitsFilters
[LogitsFiltering]?
Array of filters applied to logits before sampling.
kvCacheEmbedDim
Int?
Embedding dimension for key-value cache.
kvCacheMaxSequenceLength
Int?
Maximum sequence length for KV cache.
windowSize
Int?
Size of the attention window.
embedSize
Int?
Size of encoder output embeddings.

Methods

predictLogits(_:)
async func
Predicts logits for the next token.Parameters:
  • inputs: TextDecoderInputType - Decoder inputs including tokens and caches
Returns: TextDecoderOutputType? - Logits and updated caches
prepareDecoderInputs(withPrompt:)
func
Prepares decoder inputs with an initial prompt.Parameters:
  • initialPrompt: [Int] - Array of prompt token IDs
Returns: DecodingInputsType - Initialized decoder inputsThrows: WhisperError if preparation fails
prefillDecoderInputs(_:withOptions:)
async func
Prefills decoder inputs with language and task tokens.Parameters:
  • decoderInputs: DecodingInputsType - Inputs to prefill
  • options: DecodingOptions? - Decoding configuration
Returns: DecodingInputsType - Prefilled inputs
prefillKVCache(withTask:andLanguage:)
async func
Prefills the key-value cache using the prefill model.Parameters:
  • task: MLMultiArray - Task token (transcribe/translate)
  • language: MLMultiArray - Language token
Returns: DecodingCache? - Prefilled cache data
decodeText(from:using:sampler:options:callback:)
async func
Decodes audio embeddings into text.Parameters:
  • encoderOutput: AudioEncoderOutputType - Encoded audio
  • decoderInputs: DecodingInputsType - Decoder state
  • tokenSampler: TokenSampling - Token sampling strategy
  • decoderOptions: DecodingOptions - Decoding configuration
  • callback: TranscriptionCallback - Progress callback
Returns: DecodingResult - Decoded text and metadata
detectLanguage(from:using:sampler:options:temperature:)
async func
Detects the language of the audio.Parameters:
  • encoderOutput: AudioEncoderOutputType - Encoded audio
  • decoderInputs: DecodingInputsType - Decoder state
  • tokenSampler: TokenSampling - Token sampling strategy
  • options: DecodingOptions - Decoding configuration
  • temperature: FloatType - Sampling temperature
Returns: DecodingResult - Detected language and probabilities
updateKVCache(keyTensor:keySlice:valueTensor:valueSlice:insertAtIndex:)
static func
Updates the key-value cache with new values.Parameters:
  • keyTensor: MLMultiArray - Key cache tensor
  • keySlice: MLMultiArray - New key values
  • valueTensor: MLMultiArray - Value cache tensor
  • valueSlice: MLMultiArray - New value values
  • index: Int - Position to insert

LogitsFiltering

Filters model logits before token sampling.

Methods

filterLogits(_:withTokens:)
func
Filters the logits based on current tokens and rules.Parameters:
  • logits: MLMultiArray - Raw model logits
  • tokens: [Int] - Currently generated tokens
Returns: MLMultiArray - Filtered logits

Built-in Filters

  • SuppressTokensFilter - Suppresses specific token IDs
  • SuppressBlankFilter - Suppresses blank tokens at segment start
  • TimestampRulesFilter - Enforces timestamp pairing rules
  • LanguageLogitsFilter - Retains only language tokens

SegmentSeeking

Manages audio segmentation and word-level timestamps.

Methods

findSeekPointAndSegments(decodingResult:options:allSegmentsCount:currentSeek:segmentSize:sampleRate:timeToken:specialToken:tokenizer:)
func
Finds the next seek point and creates transcription segments.Returns: Tuple of (Int, [TranscriptionSegment]?) - next seek position and segments
addWordTimestamps(segments:alignmentWeights:tokenizer:seek:segmentSize:prependPunctuations:appendPunctuations:lastSpeechTimestamp:options:timings:)
func
Adds word-level timestamps to segments using alignment weights.Returns: [TranscriptionSegment]? - Segments with word timestampsThrows: WhisperError if timestamp alignment fails

WhisperTokenizer

Tokenizes and detokenizes text for Whisper models.

Properties

specialTokens
SpecialTokens
Special token IDs used by the model (start, end, language tokens, etc.).
allLanguageTokens
Set<Int>
Set of all language token IDs supported by the model.

Methods

encode(text:)
func
Encodes text into token IDs.Returns: [Int] - Array of token IDs
decode(tokens:)
func
Decodes token IDs into text.Returns: String - Decoded text
convertTokenToId(_:)
func
Converts a token string to its ID.Returns: Int? - Token ID, or nil if not found
convertIdToToken(_:)
func
Converts a token ID to its string representation.Returns: String? - Token string, or nil if not found
splitToWordTokens(tokenIds:)
func
Splits token IDs into words and their constituent tokens.Returns: Tuple of (words: [String], wordTokens: [[Int]])

WhisperMLModel

Base protocol for Core ML model wrappers.

Properties

model
MLModel?
The underlying Core ML model instance.

Methods

loadModel(at:computeUnits:prewarmMode:)
async func
Loads a Core ML model from disk.Parameters:
  • modelPath: URL - Path to the .mlmodelc file
  • computeUnits: MLComputeUnits - Compute units to use
  • prewarmMode: Bool - Whether to load in prewarm mode
unloadModel()
func
Unloads the model from memory.

Usage Example

// Custom audio processor
class MyAudioProcessor: AudioProcessing {
    var audioSamples: ContiguousArray<Float> = []
    var relativeEnergy: [Float] = []
    var relativeEnergyWindow: Int = 20
    
    // Implement required methods...
}

// Use custom processor
let config = WhisperKitConfig(
    model: "openai_whisper-base",
    audioProcessor: MyAudioProcessor()
)

let whisperKit = try await WhisperKit(config)

Build docs developers (and LLMs) love