WhisperKitConfig

Overview

The WhisperKitConfig class provides a comprehensive configuration interface for initializing WhisperKit. It allows you to customize model selection, download settings, compute options, audio processing, and logging behavior.

Class Definition

open class WhisperKitConfig

Initializer

public init(
    model: String? = nil,
    downloadBase: URL? = nil,
    modelRepo: String? = nil,
    modelToken: String? = nil,
    modelEndpoint: String? = nil,
    modelFolder: String? = nil,
    tokenizerFolder: URL? = nil,
    computeOptions: ModelComputeOptions? = nil,
    audioInputConfig: AudioInputConfig? = nil,
    audioProcessor: (any AudioProcessing)? = nil,
    featureExtractor: (any FeatureExtracting)? = nil,
    audioEncoder: (any AudioEncoding)? = nil,
    textDecoder: (any TextDecoding)? = nil,
    logitsFilters: [any LogitsFiltering]? = nil,
    segmentSeeker: (any SegmentSeeking)? = nil,
    voiceActivityDetector: VoiceActivityDetector? = nil,
    verbose: Bool = true,
    logLevel: Logging.LogLevel = .info,
    prewarm: Bool? = nil,
    load: Bool? = nil,
    download: Bool = true,
    useBackgroundDownloadSession: Bool = false
)

Properties

Model Selection

model

String?

Name of the Whisper model variant to use. Options include:

"tiny" - Smallest, fastest model
"tiny.en" - English-only tiny model
"base" - Base model
"base.en" - English-only base model
"small" - Small model
"small.en" - English-only small model
"medium" - Medium model
"medium.en" - English-only medium model
"large" - Large model (multilingual)
"large-v2" - Large v2 model
"large-v3" - Large v3 model

If not specified, WhisperKit will automatically select a recommended model based on your device.

Model Download Configuration

downloadBase

URL?

Base URL for downloading models. If not specified, uses the default Hugging Face Hub endpoint.

modelRepo

String?

Repository identifier for downloading models. Default is "argmaxinc/whisperkit-coreml".

modelToken

String?

Authentication token for accessing the model repository, if required.

modelEndpoint

String?

Custom Hugging Face Hub compatible endpoint URL for model downloads.

modelFolder

String?

Local file system path to a folder containing pre-downloaded model files. If specified, WhisperKit will use these models instead of downloading.

tokenizerFolder

URL?

Local file system path to a folder containing tokenizer files.

Compute Configuration

computeOptions

ModelComputeOptions?

Configuration for ML compute units. Allows you to specify which hardware (CPU, GPU, Neural Engine) to use for each model component:

ModelComputeOptions(
    melCompute: .cpuAndGPU,              // Mel spectrogram computation
    audioEncoderCompute: .cpuAndNeuralEngine,  // Audio encoder
    textDecoderCompute: .cpuAndNeuralEngine,   // Text decoder
    prefillCompute: .cpuOnly              // Cache prefill
)

Audio Configuration

audioInputConfig

AudioInputConfig?

Configuration for audio input processing, including channel mode settings:

AudioInputConfig(
    channelMode: .sumChannels(nil)  // Mix all channels or select specific ones
)

Custom Components

audioProcessor

(any AudioProcessing)?

Custom audio processor implementation. If not provided, uses the default AudioProcessor.

featureExtractor

(any FeatureExtracting)?

Custom feature extractor implementation for converting audio to mel spectrograms.

audioEncoder

(any AudioEncoding)?

Custom audio encoder implementation for encoding mel spectrograms to embeddings.

textDecoder

(any TextDecoding)?

Custom text decoder implementation for generating text from embeddings.

logitsFilters

[any LogitsFiltering]?

Array of custom logits filters to apply during text decoding.

segmentSeeker

(any SegmentSeeking)?

Custom segment seeker implementation for managing audio window processing.

voiceActivityDetector

VoiceActivityDetector?

Voice activity detector for intelligent audio chunking. See VoiceActivityDetector for details.

Logging Configuration

verbose

Bool

default:"true"

Enable verbose logging output for debugging and monitoring.

logLevel

Logging.LogLevel

default:".info"

Maximum log level to display. Options:

.debug - Detailed debug information
.info - General information
.error - Only errors
.none - No logging

Performance Configuration

prewarm

Bool?

Enable model prewarming for reduced peak memory usage during initialization.What is Prewarming?WhisperKit uses Core ML models that need to be “specialized” to your device’s chip before use. This specialization happens automatically on first load. The resulting specialized models are cached by Core ML.Trade-offs:

✅ Pro: Reduces peak memory usage by loading models sequentially
❌ Con: Doubles load time (~2x) when cache is hit and specialization isn’t needed

When to use:

Enable when minimizing peak memory is critical
Disable if you cannot afford 2x longer load time

load

Bool?

Whether to load models immediately during initialization. If nil, models are loaded automatically when modelFolder is provided.

Download Configuration

download

Bool

default:"true"

Whether to download models automatically if they are not available locally.

useBackgroundDownloadSession

Bool

default:"false"

Use a background URLSession for model downloads. Useful for downloading large models that may take significant time.

Example Usage

Basic Configuration

import WhisperKit

// Use default configuration
let config = WhisperKitConfig()
let whisperKit = try await WhisperKit(config)

Specify Model

// Use a specific model
let config = WhisperKitConfig(
    model: "base",
    download: true
)
let whisperKit = try await WhisperKit(config)

Local Model

// Use locally stored models
let config = WhisperKitConfig(
    modelFolder: "/path/to/models/openai_whisper-base",
    load: true,
    download: false
)
let whisperKit = try await WhisperKit(config)

Custom Compute Options

// Optimize for specific hardware
let config = WhisperKitConfig(
    model: "small",
    computeOptions: ModelComputeOptions(
        melCompute: .cpuAndGPU,
        audioEncoderCompute: .cpuAndNeuralEngine,
        textDecoderCompute: .cpuAndNeuralEngine,
        prefillCompute: .cpuOnly
    )
)
let whisperKit = try await WhisperKit(config)

With Voice Activity Detection

// Use VAD for better chunking
let config = WhisperKitConfig(
    model: "base",
    voiceActivityDetector: EnergyVAD(),
    verbose: true
)
let whisperKit = try await WhisperKit(config)

Memory-Optimized Configuration

// Optimize for low memory usage
let config = WhisperKitConfig(
    model: "tiny",
    computeOptions: ModelComputeOptions(
        audioEncoderCompute: .cpuOnly,
        textDecoderCompute: .cpuOnly
    ),
    prewarm: true,  // Load models sequentially
    load: true
)
let whisperKit = try await WhisperKit(config)

Custom Repository

// Use a custom model repository
let config = WhisperKitConfig(
    model: "base",
    modelRepo: "myorg/custom-whisper-models",
    modelToken: "hf_...",  // If private repo
    download: true
)
let whisperKit = try await WhisperKit(config)

Multi-Channel Audio

// Configure channel processing
let config = WhisperKitConfig(
    model: "base",
    audioInputConfig: AudioInputConfig(
        channelMode: .specificChannel(0)  // Use only the first channel
    )
)
let whisperKit = try await WhisperKit(config)

DecodingOptions

While WhisperKitConfig configures the WhisperKit instance, DecodingOptions configures individual transcription requests.

public struct DecodingOptions: Codable, Sendable

Common Parameters

verbose

Bool

default:"false"

Display decoding progress and details

task

DecodingTask

default:".transcribe"

Either .transcribe (X→X) or .translate (X→English)

language

String?

Language code (e.g., “en”, “es”, “fr”). If nil, language is auto-detected.

temperature

Float

default:"0.0"

Sampling temperature (0.0 = deterministic, higher = more random)

temperatureIncrementOnFallback

Float

default:"0.2"

Temperature increment when decoding fails and needs retry

temperatureFallbackCount

Int

default:"5"

Maximum number of temperature fallback attempts

withoutTimestamps

Bool

default:"false"

Disable timestamp prediction

wordTimestamps

Bool

default:"false"

Enable word-level timestamps (requires more computation)

clipTimestamps

[Float]

default:"[]"

Array of timestamps (in seconds) to split audio into segments

promptTokens

[Int]?

Token IDs to use as conditioning prompt

suppressBlank

Bool

default:"false"

Suppress blank tokens during decoding

compressionRatioThreshold

Float?

default:"2.4"

Threshold for detecting repetitive text (triggers fallback)

logProbThreshold

Float?

default:"-1.0"

Minimum average log probability (triggers fallback if below)

noSpeechThreshold

Float?

default:"0.6"

Probability threshold for detecting silence

concurrentWorkerCount

Int

Number of concurrent workers for parallel transcription (default: 16 on macOS, 4 on iOS)

chunkingStrategy

ChunkingStrategy?

Strategy for chunking long audio:

.none - No chunking
.vad - Voice activity detection based chunking

Example Usage

// Transcribe with custom options
let options = DecodingOptions(
    task: .transcribe,
    language: "en",
    wordTimestamps: true,
    temperature: 0.0,
    chunkingStrategy: .vad
)

let results = try await whisperKit.transcribe(
    audioPath: "/path/to/audio.wav",
    decodeOptions: options
)

WhisperKit

TTSKit

Core Types

Overview

Class Definition

Initializer

Properties

Model Selection

Model Download Configuration

Compute Configuration

Audio Configuration

Custom Components

Logging Configuration

Performance Configuration

Download Configuration

Example Usage

Basic Configuration

Specify Model

Local Model

Custom Compute Options

With Voice Activity Detection

Memory-Optimized Configuration

Custom Repository

Multi-Channel Audio

DecodingOptions

Common Parameters

Example Usage

Build docs developers (and LLMs) love

WhisperKit

TTSKit

Core Types

Documentation Index

​Overview

​Class Definition

​Initializer

​Properties

​Model Selection

​Model Download Configuration

​Compute Configuration

​Audio Configuration

​Custom Components

​Logging Configuration

​Performance Configuration

​Download Configuration

​Example Usage

​Basic Configuration

​Specify Model

​Local Model

​Custom Compute Options

​With Voice Activity Detection

​Memory-Optimized Configuration

​Custom Repository

​Multi-Channel Audio

​DecodingOptions

​Common Parameters

​Example Usage

Build docs developers (and LLMs) love

Overview

Class Definition

Initializer

Properties

Model Selection

Model Download Configuration

Compute Configuration

Audio Configuration

Custom Components

Logging Configuration

Performance Configuration

Download Configuration

Example Usage

Basic Configuration

Specify Model

Local Model

Custom Compute Options

With Voice Activity Detection

Memory-Optimized Configuration

Custom Repository

Multi-Channel Audio

DecodingOptions

Common Parameters

Example Usage