Skip to main content

Audio Transcription

WhisperKit provides flexible APIs for transcribing audio from files, arrays, or live input. All transcription methods return TranscriptionResult objects containing the recognized text and detailed metadata.

Basic Transcription

Transcribe from File Path

The simplest way to transcribe audio:
let whisperKit = try await WhisperKit()

// Transcribe a single audio file
let results: [TranscriptionResult] = try await whisperKit.transcribe(
    audioPath: "path/to/audio.wav"
)

for result in results {
    print("Text: \(result.text)")
    print("Language: \(result.language)")
}
See WhisperKit.swift:840-872

Transcribe from Audio Array

For pre-loaded audio samples:
// Audio must be 16kHz mono float array
let audioArray: [Float] = loadAudioSamples()

let results = try await whisperKit.transcribe(
    audioArray: audioArray
)
See WhisperKit.swift:896-960

Decoding Options

Customize transcription behavior with DecodingOptions:
var options = DecodingOptions(
    verbose: true,
    task: .transcribe,           // or .translate for English translation
    language: "en",              // Specify language or nil for auto-detect
    temperature: 0.0,            // 0.0 for greedy, >0 for sampling
    wordTimestamps: true,        // Enable word-level timestamps
    skipSpecialTokens: true      // Remove special tokens from output
)

let results = try await whisperKit.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)
See DecodingOptions

Key DecodingOptions Parameters

task
DecodingTask
default:".transcribe"
  • .transcribe - Transcribe audio in its original language
  • .translate - Translate to English
language
String?
default:"nil"
Language code (e.g., “en”, “es”, “fr”). If nil, language is auto-detected for multilingual models.
temperature
Float
default:"0.0"
Sampling temperature. 0.0 uses greedy decoding (most accurate), higher values increase randomness.
wordTimestamps
Bool
default:"false"
Enable word-level timestamps in the output.
withoutTimestamps
Bool
default:"false"
Disable all timestamps (faster decoding).
compressionRatioThreshold
Float?
default:"2.4"
If text compression ratio exceeds this, mark as failed (indicates repetition/hallucination).
logProbThreshold
Float?
default:"-1.0"
If average log probability is below this, mark as failed (low confidence).
concurrentWorkerCount
Int
default:"16 (macOS) / 4 (iOS)"
Number of concurrent workers for parallel transcription of multiple audio files.

Batch Transcription

Transcribe Multiple Files

let audioPaths = [
    "audio1.wav",
    "audio2.wav",
    "audio3.wav"
]

let results = await whisperKit.transcribe(
    audioPaths: audioPaths,
    decodeOptions: options
)

// Results is [[TranscriptionResult]?] - one array per file
for (index, result) in results.enumerated() {
    if let transcriptions = result {
        print("File \(index): \(transcriptions.first?.text ?? "")")
    }
}
See WhisperKit.swift:599-673

Using Results for Error Handling

let results = await whisperKit.transcribeWithResults(
    audioPaths: audioPaths
)

for (index, result) in results.enumerated() {
    switch result {
    case .success(let transcriptions):
        print("Success: \(transcriptions.first?.text ?? "")")
    case .failure(let error):
        print("Error transcribing file \(index): \(error)")
    }
}
See WhisperKit.swift:624-673

TranscriptionResult Structure

Each transcription returns a TranscriptionResult with:
let result: TranscriptionResult = results.first!

// Full transcribed text
print(result.text)

// Detected language
print(result.language)

// Individual segments with timestamps
for segment in result.segments {
    print("[\(segment.start)s - \(segment.end)s]: \(segment.text)")
    
    // Word-level timestamps (if enabled)
    if let words = segment.words {
        for word in words {
            print("  \(word.word): \(word.start)s - \(word.end)s")
        }
    }
}

// Performance timings
result.logTimings()
See TranscriptionResult

TranscriptionSegment Properties

  • id: Int - Segment identifier
  • seek: Int - Seek position in audio samples
  • start: Float - Start time in seconds
  • end: Float - End time in seconds
  • text: String - Transcribed text
  • tokens: [Int] - Token IDs
  • words: [WordTiming]? - Word-level timestamps (if enabled)

Language Detection

Detect the language of an audio file:
let (language, probabilities) = try await whisperKit.detectLanguage(
    audioPath: "audio.wav"
)

print("Detected language: \(language)")
for (lang, prob) in probabilities.sorted(by: { $0.value > $1.value }).prefix(5) {
    print("  \(lang): \(prob)")
}
See WhisperKit.swift:533-593
Language detection only works with multilingual models. English-only models (tiny.en, base.en, etc.) will throw an error.

Progress Tracking

Monitor transcription progress with a callback:
let results = try await whisperKit.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options,
    callback: { progress in
        print("Window \(progress.windowId): \(progress.text)")
        print("Tokens: \(progress.tokens.count)")
        
        // Return false to cancel early
        return shouldContinue ? nil : false
    }
)

TranscriptionProgress Properties

  • windowId: Int - Current audio window being processed
  • text: String - Current decoded text
  • tokens: [Int] - Generated tokens
  • avgLogprob: Float? - Average log probability
  • timings: TranscriptionTimings - Performance metrics

Advanced Features

Clip Timestamps

Transcribe specific time segments:
var options = DecodingOptions(
    clipTimestamps: [0.0, 30.0, 60.0]  // Transcribe 0-30s and 60s onwards
)

let results = try await whisperKit.transcribe(
    audioPath: "audio.wav",
    decodeOptions: options
)

Prompt Tokens

Provide context to improve accuracy:
let promptTokens = whisperKit.tokenizer?.encode(text: "Welcome to the conference")

var options = DecodingOptions(
    promptTokens: promptTokens
)

Temperature Fallback

Automatically retry with higher temperature on failure:
var options = DecodingOptions(
    temperature: 0.0,
    temperatureIncrementOnFallback: 0.2,  // Increase by 0.2 each retry
    temperatureFallbackCount: 5            // Try up to 5 times
)
See DecodingOptions

Best Practices

Audio Format

WhisperKit expects 16kHz mono audio. Use AudioProcessor.loadAudio() to convert formats automatically.

Model Selection

Use smaller models (tiny, base) for real-time needs and larger models (medium, large) for accuracy.

Error Handling

Use transcribeWithResults() for robust error handling when processing multiple files.

Memory Management

Unload models when not in use: await whisperKit.unloadModels()

Next Steps

Streaming

Learn real-time transcription

Voice Activity Detection

Optimize with VAD

Build docs developers (and LLMs) love