Audio Transcription

WhisperKit provides flexible APIs for transcribing audio from files, arrays, or live input. All transcription methods return TranscriptionResult objects containing the recognized text and detailed metadata.

Basic Transcription

Transcribe from File Path

The simplest way to transcribe audio:

let whisperKit = try await WhisperKit()

// Transcribe a single audio file
let results: [TranscriptionResult] = try await whisperKit.transcribe(
    audioPath: "path/to/audio.wav"
)

for result in results {
    print("Text: \(result.text)")
    print("Language: \(result.language)")
}

See WhisperKit.swift:840-872

Transcribe from Audio Array

For pre-loaded audio samples:

// Audio must be 16kHz mono float array
let audioArray: [Float] = loadAudioSamples()

let results = try await whisperKit.transcribe(
    audioArray: audioArray
)

See WhisperKit.swift:896-960

Decoding Options

Customize transcription behavior with DecodingOptions:

var options = DecodingOptions(
    verbose: true,
    task: .transcribe,           // or .translate for English translation
    language: "en",              // Specify language or nil for auto-detect
    temperature: 0.0,            // 0.0 for greedy, >0 for sampling
    wordTimestamps: true,        // Enable word-level timestamps
    skipSpecialTokens: true      // Remove special tokens from output
)

let results = try await whisperKit.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)

See DecodingOptions

Key DecodingOptions Parameters

task

DecodingTask

default:".transcribe"

.transcribe - Transcribe audio in its original language
.translate - Translate to English

language

String?

default:"nil"

Language code (e.g., “en”, “es”, “fr”). If nil, language is auto-detected for multilingual models.

temperature

Float

default:"0.0"

Sampling temperature. 0.0 uses greedy decoding (most accurate), higher values increase randomness.

wordTimestamps

Bool

default:"false"

Enable word-level timestamps in the output.

withoutTimestamps

Bool

default:"false"

Disable all timestamps (faster decoding).

compressionRatioThreshold

Float?

default:"2.4"

If text compression ratio exceeds this, mark as failed (indicates repetition/hallucination).

logProbThreshold

Float?

default:"-1.0"

If average log probability is below this, mark as failed (low confidence).

concurrentWorkerCount

Int

default:"16 (macOS) / 4 (iOS)"

Number of concurrent workers for parallel transcription of multiple audio files.

Batch Transcription

Transcribe Multiple Files

let audioPaths = [
    "audio1.wav",
    "audio2.wav",
    "audio3.wav"
]

let results = await whisperKit.transcribe(
    audioPaths: audioPaths,
    decodeOptions: options
)

// Results is [[TranscriptionResult]?] - one array per file
for (index, result) in results.enumerated() {
    if let transcriptions = result {
        print("File \(index): \(transcriptions.first?.text ?? "")")
    }
}

See WhisperKit.swift:599-673

Using Results for Error Handling

let results = await whisperKit.transcribeWithResults(
    audioPaths: audioPaths
)

for (index, result) in results.enumerated() {
    switch result {
    case .success(let transcriptions):
        print("Success: \(transcriptions.first?.text ?? "")")
    case .failure(let error):
        print("Error transcribing file \(index): \(error)")
    }
}

See WhisperKit.swift:624-673

TranscriptionResult Structure

Each transcription returns a TranscriptionResult with:

let result: TranscriptionResult = results.first!

// Full transcribed text
print(result.text)

// Detected language
print(result.language)

// Individual segments with timestamps
for segment in result.segments {
    print("[\(segment.start)s - \(segment.end)s]: \(segment.text)")
    
    // Word-level timestamps (if enabled)
    if let words = segment.words {
        for word in words {
            print("  \(word.word): \(word.start)s - \(word.end)s")
        }
    }
}

// Performance timings
result.logTimings()

See TranscriptionResult

TranscriptionSegment Properties

id: Int - Segment identifier
seek: Int - Seek position in audio samples
start: Float - Start time in seconds
end: Float - End time in seconds
text: String - Transcribed text
tokens: [Int] - Token IDs
words: [WordTiming]? - Word-level timestamps (if enabled)

Language Detection

Detect the language of an audio file:

let (language, probabilities) = try await whisperKit.detectLanguage(
    audioPath: "audio.wav"
)

print("Detected language: \(language)")
for (lang, prob) in probabilities.sorted(by: { $0.value > $1.value }).prefix(5) {
    print("  \(lang): \(prob)")
}

See WhisperKit.swift:533-593

Language detection only works with multilingual models. English-only models (tiny.en, base.en, etc.) will throw an error.

Progress Tracking

Monitor transcription progress with a callback:

let results = try await whisperKit.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options,
    callback: { progress in
        print("Window \(progress.windowId): \(progress.text)")
        print("Tokens: \(progress.tokens.count)")
        
        // Return false to cancel early
        return shouldContinue ? nil : false
    }
)

TranscriptionProgress Properties

windowId: Int - Current audio window being processed
text: String - Current decoded text
tokens: [Int] - Generated tokens
avgLogprob: Float? - Average log probability
timings: TranscriptionTimings - Performance metrics

Advanced Features

Clip Timestamps

Transcribe specific time segments:

var options = DecodingOptions(
    clipTimestamps: [0.0, 30.0, 60.0]  // Transcribe 0-30s and 60s onwards
)

let results = try await whisperKit.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)

Prompt Tokens

Provide context to improve accuracy:

let promptTokens = whisperKit.tokenizer?.encode(text: "Welcome to the conference")

var options = DecodingOptions(
    promptTokens: promptTokens
)

Temperature Fallback

Automatically retry with higher temperature on failure:

var options = DecodingOptions(
    temperature: 0.0,
    temperatureIncrementOnFallback: 0.2,  // Increase by 0.2 each retry
    temperatureFallbackCount: 5            // Try up to 5 times
)

See DecodingOptions

Best Practices

Audio Format

WhisperKit expects 16kHz mono audio. Use AudioProcessor.loadAudio() to convert formats automatically.

Model Selection

Use smaller models (tiny, base) for real-time needs and larger models (medium, large) for accuracy.

Error Handling

Use transcribeWithResults() for robust error handling when processing multiple files.

Memory Management

Unload models when not in use: await whisperKit.unloadModels()

Next Steps

Streaming

Learn real-time transcription

Voice Activity Detection

Optimize with VAD

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Audio Transcription

Audio Transcription

Basic Transcription

Transcribe from File Path

Transcribe from Audio Array

Decoding Options

Key DecodingOptions Parameters

Batch Transcription

Transcribe Multiple Files

Using Results for Error Handling

TranscriptionResult Structure

TranscriptionSegment Properties

Language Detection

Progress Tracking

TranscriptionProgress Properties

Advanced Features

Clip Timestamps

Prompt Tokens

Temperature Fallback

Best Practices

Audio Format

Model Selection

Error Handling

Memory Management

Next Steps

Streaming

Voice Activity Detection

Build docs developers (and LLMs) love

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Documentation Index

​Audio Transcription

​Basic Transcription

​Transcribe from File Path

​Transcribe from Audio Array

​Decoding Options

​Key DecodingOptions Parameters

​Batch Transcription

​Transcribe Multiple Files

​Using Results for Error Handling

​TranscriptionResult Structure

​TranscriptionSegment Properties

​Language Detection

​Progress Tracking

​TranscriptionProgress Properties

​Advanced Features

​Clip Timestamps

​Prompt Tokens

​Temperature Fallback

​Best Practices

Audio Format

Model Selection

Error Handling

Memory Management

​Next Steps

Streaming

Voice Activity Detection

Build docs developers (and LLMs) love

Audio Transcription

Basic Transcription

Transcribe from File Path

Transcribe from Audio Array

Decoding Options

Key DecodingOptions Parameters

Batch Transcription

Transcribe Multiple Files

Using Results for Error Handling

TranscriptionResult Structure

TranscriptionSegment Properties

Language Detection

Progress Tracking

TranscriptionProgress Properties

Advanced Features

Clip Timestamps

Prompt Tokens

Temperature Fallback

Best Practices

Next Steps