Skip to main content

Overview

The WhisperKitConfig class provides a comprehensive configuration interface for initializing WhisperKit. It allows you to customize model selection, download settings, compute options, audio processing, and logging behavior.

Class Definition

open class WhisperKitConfig

Initializer

public init(
    model: String? = nil,
    downloadBase: URL? = nil,
    modelRepo: String? = nil,
    modelToken: String? = nil,
    modelEndpoint: String? = nil,
    modelFolder: String? = nil,
    tokenizerFolder: URL? = nil,
    computeOptions: ModelComputeOptions? = nil,
    audioInputConfig: AudioInputConfig? = nil,
    audioProcessor: (any AudioProcessing)? = nil,
    featureExtractor: (any FeatureExtracting)? = nil,
    audioEncoder: (any AudioEncoding)? = nil,
    textDecoder: (any TextDecoding)? = nil,
    logitsFilters: [any LogitsFiltering]? = nil,
    segmentSeeker: (any SegmentSeeking)? = nil,
    voiceActivityDetector: VoiceActivityDetector? = nil,
    verbose: Bool = true,
    logLevel: Logging.LogLevel = .info,
    prewarm: Bool? = nil,
    load: Bool? = nil,
    download: Bool = true,
    useBackgroundDownloadSession: Bool = false
)

Properties

Model Selection

model
String?
Name of the Whisper model variant to use. Options include:
  • "tiny" - Smallest, fastest model
  • "tiny.en" - English-only tiny model
  • "base" - Base model
  • "base.en" - English-only base model
  • "small" - Small model
  • "small.en" - English-only small model
  • "medium" - Medium model
  • "medium.en" - English-only medium model
  • "large" - Large model (multilingual)
  • "large-v2" - Large v2 model
  • "large-v3" - Large v3 model
If not specified, WhisperKit will automatically select a recommended model based on your device.

Model Download Configuration

downloadBase
URL?
Base URL for downloading models. If not specified, uses the default Hugging Face Hub endpoint.
modelRepo
String?
Repository identifier for downloading models. Default is "argmaxinc/whisperkit-coreml".
modelToken
String?
Authentication token for accessing the model repository, if required.
modelEndpoint
String?
Custom Hugging Face Hub compatible endpoint URL for model downloads.
modelFolder
String?
Local file system path to a folder containing pre-downloaded model files. If specified, WhisperKit will use these models instead of downloading.
tokenizerFolder
URL?
Local file system path to a folder containing tokenizer files.

Compute Configuration

computeOptions
ModelComputeOptions?
Configuration for ML compute units. Allows you to specify which hardware (CPU, GPU, Neural Engine) to use for each model component:
ModelComputeOptions(
    melCompute: .cpuAndGPU,              // Mel spectrogram computation
    audioEncoderCompute: .cpuAndNeuralEngine,  // Audio encoder
    textDecoderCompute: .cpuAndNeuralEngine,   // Text decoder
    prefillCompute: .cpuOnly              // Cache prefill
)

Audio Configuration

audioInputConfig
AudioInputConfig?
Configuration for audio input processing, including channel mode settings:
AudioInputConfig(
    channelMode: .sumChannels(nil)  // Mix all channels or select specific ones
)

Custom Components

audioProcessor
(any AudioProcessing)?
Custom audio processor implementation. If not provided, uses the default AudioProcessor.
featureExtractor
(any FeatureExtracting)?
Custom feature extractor implementation for converting audio to mel spectrograms.
audioEncoder
(any AudioEncoding)?
Custom audio encoder implementation for encoding mel spectrograms to embeddings.
textDecoder
(any TextDecoding)?
Custom text decoder implementation for generating text from embeddings.
logitsFilters
[any LogitsFiltering]?
Array of custom logits filters to apply during text decoding.
segmentSeeker
(any SegmentSeeking)?
Custom segment seeker implementation for managing audio window processing.
voiceActivityDetector
VoiceActivityDetector?
Voice activity detector for intelligent audio chunking. See VoiceActivityDetector for details.

Logging Configuration

verbose
Bool
default:"true"
Enable verbose logging output for debugging and monitoring.
logLevel
Logging.LogLevel
default:".info"
Maximum log level to display. Options:
  • .debug - Detailed debug information
  • .info - General information
  • .error - Only errors
  • .none - No logging

Performance Configuration

prewarm
Bool?
Enable model prewarming for reduced peak memory usage during initialization.What is Prewarming?WhisperKit uses Core ML models that need to be “specialized” to your device’s chip before use. This specialization happens automatically on first load. The resulting specialized models are cached by Core ML.Trade-offs:
  • Pro: Reduces peak memory usage by loading models sequentially
  • Con: Doubles load time (~2x) when cache is hit and specialization isn’t needed
When to use:
  • Enable when minimizing peak memory is critical
  • Disable if you cannot afford 2x longer load time
load
Bool?
Whether to load models immediately during initialization. If nil, models are loaded automatically when modelFolder is provided.

Download Configuration

download
Bool
default:"true"
Whether to download models automatically if they are not available locally.
useBackgroundDownloadSession
Bool
default:"false"
Use a background URLSession for model downloads. Useful for downloading large models that may take significant time.

Example Usage

Basic Configuration

import WhisperKit

// Use default configuration
let config = WhisperKitConfig()
let whisperKit = try await WhisperKit(config)

Specify Model

// Use a specific model
let config = WhisperKitConfig(
    model: "base",
    download: true
)
let whisperKit = try await WhisperKit(config)

Local Model

// Use locally stored models
let config = WhisperKitConfig(
    modelFolder: "/path/to/models/openai_whisper-base",
    load: true,
    download: false
)
let whisperKit = try await WhisperKit(config)

Custom Compute Options

// Optimize for specific hardware
let config = WhisperKitConfig(
    model: "small",
    computeOptions: ModelComputeOptions(
        melCompute: .cpuAndGPU,
        audioEncoderCompute: .cpuAndNeuralEngine,
        textDecoderCompute: .cpuAndNeuralEngine,
        prefillCompute: .cpuOnly
    )
)
let whisperKit = try await WhisperKit(config)

With Voice Activity Detection

// Use VAD for better chunking
let config = WhisperKitConfig(
    model: "base",
    voiceActivityDetector: EnergyVAD(),
    verbose: true
)
let whisperKit = try await WhisperKit(config)

Memory-Optimized Configuration

// Optimize for low memory usage
let config = WhisperKitConfig(
    model: "tiny",
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuOnly,
        textDecoderCompute: .cpuOnly
    ),
    prewarm: true,  // Load models sequentially
    load: true
)
let whisperKit = try await WhisperKit(config)

Custom Repository

// Use a custom model repository
let config = WhisperKitConfig(
    model: "base",
    modelRepo: "myorg/custom-whisper-models",
    modelToken: "hf_...",  // If private repo
    download: true
)
let whisperKit = try await WhisperKit(config)

Multi-Channel Audio

// Configure channel processing
let config = WhisperKitConfig(
    model: "base",
    audioInputConfig: AudioInputConfig(
        channelMode: .specificChannel(0)  // Use only the first channel
    )
)
let whisperKit = try await WhisperKit(config)

DecodingOptions

While WhisperKitConfig configures the WhisperKit instance, DecodingOptions configures individual transcription requests.
public struct DecodingOptions: Codable, Sendable

Common Parameters

verbose
Bool
default:"false"
Display decoding progress and details
task
DecodingTask
default:".transcribe"
Either .transcribe (X→X) or .translate (X→English)
language
String?
Language code (e.g., “en”, “es”, “fr”). If nil, language is auto-detected.
temperature
Float
default:"0.0"
Sampling temperature (0.0 = deterministic, higher = more random)
temperatureIncrementOnFallback
Float
default:"0.2"
Temperature increment when decoding fails and needs retry
temperatureFallbackCount
Int
default:"5"
Maximum number of temperature fallback attempts
withoutTimestamps
Bool
default:"false"
Disable timestamp prediction
wordTimestamps
Bool
default:"false"
Enable word-level timestamps (requires more computation)
clipTimestamps
[Float]
default:"[]"
Array of timestamps (in seconds) to split audio into segments
promptTokens
[Int]?
Token IDs to use as conditioning prompt
suppressBlank
Bool
default:"false"
Suppress blank tokens during decoding
compressionRatioThreshold
Float?
default:"2.4"
Threshold for detecting repetitive text (triggers fallback)
logProbThreshold
Float?
default:"-1.0"
Minimum average log probability (triggers fallback if below)
noSpeechThreshold
Float?
default:"0.6"
Probability threshold for detecting silence
concurrentWorkerCount
Int
Number of concurrent workers for parallel transcription (default: 16 on macOS, 4 on iOS)
chunkingStrategy
ChunkingStrategy?
Strategy for chunking long audio:
  • .none - No chunking
  • .vad - Voice activity detection based chunking

Example Usage

// Transcribe with custom options
let options = DecodingOptions(
    task: .transcribe,
    language: "en",
    wordTimestamps: true,
    temperature: 0.0,
    chunkingStrategy: .vad
)

let results = try await whisperKit.transcribe(
    audioPath: "/path/to/audio.wav",
    decodeOptions: options
)

Build docs developers (and LLMs) love