Skip to main content

WhisperKit Overview

WhisperKit is a Swift framework that brings OpenAI’s Whisper speech recognition models to Apple devices. It leverages Core ML for efficient on-device inference, enabling privacy-focused, low-latency speech-to-text capabilities.

Key Features

On-Device Processing

All transcription happens locally on the device, ensuring user privacy and enabling offline functionality.

Multiple Model Sizes

Choose from tiny, base, small, medium, and large variants to balance accuracy and performance.

Multilingual Support

Supports 99+ languages with automatic language detection for multilingual models.

Real-time Streaming

Stream audio from the microphone and get transcriptions in real-time.

Core Components

WhisperKit consists of several key components:

WhisperKit Class

The main entry point for speech-to-text functionality. It orchestrates all the components needed for transcription.
let whisperKit = try await WhisperKit()
See WhisperKit.swift:12

Audio Processing Pipeline

  1. AudioProcessor - Captures and processes audio input
  2. FeatureExtractor - Converts audio to mel spectrograms
  3. AudioEncoder - Encodes audio features using the Whisper encoder model
  4. TextDecoder - Decodes encoded features into text tokens
  5. Tokenizer - Converts tokens to readable text

Model Architecture

WhisperKit uses Core ML models downloaded from Hugging Face Hub:
  • MelSpectrogram.mlmodelc - Audio feature extraction
  • AudioEncoder.mlmodelc - Audio encoding
  • TextDecoder.mlmodelc - Text decoding
  • TextDecoderContextPrefill.mlmodelc - Optional prefill optimization

Quick Start

Basic Initialization

import WhisperKit

// Initialize with default model
let whisperKit = try await WhisperKit()

// Transcribe an audio file
let results = try await whisperKit.transcribe(audioPath: "path/to/audio.wav")
for result in results {
    print(result.text)
}

Custom Configuration

// Initialize with specific model and options
let config = WhisperKitConfig(
    model: "large-v3",
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuAndNeuralEngine
    ),
    verbose: true,
    download: true
)

let whisperKit = try await WhisperKit(config)
See WhisperKitConfig

Model States

WhisperKit models progress through several states:
  • unloaded - Models not yet loaded
  • prewarming - Models being specialized for the device (optional)
  • prewarmed - Specialization complete
  • loading - Models being loaded into memory
  • loaded - Ready for transcription
  • unloading - Models being removed from memory
See WhisperKit.swift:15-19

Sample Rates and Constants

WhisperKit uses fixed audio parameters matching the Whisper model requirements:
WhisperKit.sampleRate // 16000 Hz
WhisperKit.hopLength // 160 samples
WhisperKit.secondsPerTimeToken // 0.02 seconds
See WhisperKit.swift:39-41

Next Steps

Transcription

Learn how to transcribe audio files

Streaming

Real-time audio transcription

Model Selection

Choose the right model for your needs

Configuration

Advanced configuration options

Build docs developers (and LLMs) love