TranscriptionResult

Overview

The TranscriptionResult class represents the output of a transcription operation. It contains the transcribed text, detailed segment information, language detection results, and performance timing data.

Class Definition

open class TranscriptionResult: Codable, @unchecked Sendable

Initializer

public init(
    text: String,
    segments: [TranscriptionSegment],
    language: String,
    timings: TranscriptionTimings,
    seekTime: Float? = nil
)

text

String

Complete transcribed text

segments

[TranscriptionSegment]

Array of transcription segments with timestamps and metadata

language

String

Detected or specified language code

timings

TranscriptionTimings

Performance timing information

seekTime

Float?

Seek time offset in seconds (for chunked audio)

Properties

text

String

The complete transcribed text. All segments are concatenated together.

segments

[TranscriptionSegment]

Array of transcription segments, each containing:

Text content
Start and end timestamps
Token information
Quality metrics (log probabilities, compression ratio)
Optional word-level timestamps

language

String

ISO 639-1 language code (e.g., “en” for English, “es” for Spanish) detected or specified for this transcription.

timings

TranscriptionTimings

Detailed performance metrics including:

Model loading time
Audio processing time
Encoding time
Decoding time
Total pipeline duration
Real-time factor

seekTime

Float?

Seek time offset in seconds when this result is part of a chunked transcription.

Computed Properties

allWords

[WordTiming]

Flat array of all word-level timings across all segments. Only populated when wordTimestamps is enabled in DecodingOptions.

Methods

logSegments()

Logs all segments with timestamps and text to the console.

public func logSegments()

Output Format:

[Segment 0] [00:00.00 --> 00:03.50] Hello, this is a test.
[Segment 1] [00:03.50 --> 00:07.20] This is the second segment.

logTimings()

Logs detailed performance timing information to the console.

public func logTimings()

Output includes:

Audio loading time
Audio processing time
Mel spectrogram computation time
Encoding time
Decoding time breakdown
Total pipeline duration
Tokens per second
Real-time factor

TranscriptionSegment

Each segment in the segments array contains detailed information:

public struct TranscriptionSegment: Hashable, Codable, Sendable

Properties

Int

Unique identifier for the segment

seek

Int

Seek position in the audio (in samples)

start

Float

Start timestamp in seconds

end

Float

End timestamp in seconds

text

String

Transcribed text for this segment

tokens

[Int]

Token IDs generated for this segment

tokenLogProbs

[[Int: Float]]

Log probabilities for each token

temperature

Float

Sampling temperature used for this segment

avgLogprob

Float

Average log probability of all tokens (quality indicator)

compressionRatio

Float

Text compression ratio (detects repetitive output)

noSpeechProb

Float

Probability that this segment contains no speech

words

[WordTiming]?

Optional array of word-level timings (only when wordTimestamps is enabled)

duration

Float

Computed duration of the segment (end - start)

WordTiming

When word-level timestamps are enabled, each word includes:

public struct WordTiming: Hashable, Codable, Sendable

Properties

word

String

The word text

tokens

[Int]

Token IDs that comprise this word

start

Float

Start timestamp in seconds

end

Float

End timestamp in seconds

probability

Float

Confidence probability for this word

duration

Float

Computed duration (end - start)

TranscriptionTimings

Detailed performance metrics:

public struct TranscriptionTimings: Codable, Sendable

Properties

modelLoading

TimeInterval

Total time spent loading models

audioLoading

TimeInterval

Time spent loading and converting audio

audioProcessing

TimeInterval

Time spent processing audio samples

logmels

TimeInterval

Time spent computing mel spectrograms

encoding

TimeInterval

Time spent in audio encoder

decodingPredictions

TimeInterval

Time spent in text decoder predictions

decodingLoop

TimeInterval

Total time spent in decoding loop

fullPipeline

TimeInterval

Total end-to-end pipeline duration

tokensPerSecond

Double

Computed: tokens generated per second

realTimeFactor

Double

Computed: ratio of processing time to audio duration (< 1.0 means faster than real-time)

speedFactor

Double

Computed: inverse of real-time factor (> 1.0 means faster than real-time)

Example Usage

Basic Transcription

import WhisperKit

let whisperKit = try await WhisperKit()
let results = try await whisperKit.transcribe(
    audioPath: "/path/to/audio.wav"
)

for result in results {
    print("Language: \(result.language)")
    print("Text: \(result.text)")
    print("Segments: \(result.segments.count)")
}

Access Segments

for segment in result.segments {
    print("[\(segment.start)s - \(segment.end)s]: \(segment.text)")
    print("  Confidence: \(segment.avgLogprob)")
    print("  No speech prob: \(segment.noSpeechProb)")
}

Word-Level Timestamps

let options = DecodingOptions(
    wordTimestamps: true
)

let results = try await whisperKit.transcribe(
    audioPath: "/path/to/audio.wav",
    decodeOptions: options
)

for result in results {
    for word in result.allWords {
        print("\(word.word) [\(word.start)s - \(word.end)s]")
    }
}

Performance Analysis

let result = results.first!

print("Real-time factor: \(result.timings.realTimeFactor)")
if result.timings.realTimeFactor < 1.0 {
    print("Processing faster than real-time!")
}

print("Tokens per second: \(result.timings.tokensPerSecond)")
print("Total time: \(result.timings.fullPipeline)s")
print("Audio duration: \(result.timings.inputAudioSeconds)s")

// Log detailed breakdown
result.logTimings()

Quality Filtering

// Filter out low-quality segments
let highQualitySegments = result.segments.filter { segment in
    segment.avgLogprob > -0.5 &&  // Good confidence
    segment.noSpeechProb < 0.3 &&  // Low silence probability
    segment.compressionRatio < 2.0  // Not repetitive
}

for segment in highQualitySegments {
    print(segment.text)
}

Export to SRT Format

func exportToSRT(result: TranscriptionResult) -> String {
    var srt = ""
    
    for (index, segment) in result.segments.enumerated() {
        let startTime = formatTimestamp(segment.start)
        let endTime = formatTimestamp(segment.end)
        
        srt += "\(index + 1)\n"
        srt += "\(startTime) --> \(endTime)\n"
        srt += "\(segment.text.trimmingCharacters(in: .whitespaces))\n\n"
    }
    
    return srt
}

func formatTimestamp(_ seconds: Float) -> String {
    let hours = Int(seconds) / 3600
    let minutes = (Int(seconds) % 3600) / 60
    let secs = Int(seconds) % 60
    let millis = Int((seconds - Float(Int(seconds))) * 1000)
    
    return String(format: "%02d:%02d:%02d,%03d", hours, minutes, secs, millis)
}

// Usage
let srtContent = exportToSRT(result: results.first!)
print(srtContent)

Monitor Progress

let results = try await whisperKit.transcribe(
    audioPath: "/path/to/audio.wav"
) { progress in
    print("Current text: \(progress.text)")
    print("Tokens: \(progress.tokens.count)")
    if let temp = progress.temperature {
        print("Temperature: \(temp)")
    }
    return true  // Continue
}

WhisperKit

TTSKit

Core Types

Overview

Class Definition

Initializer

Properties

Computed Properties

Methods

logSegments()

logTimings()

TranscriptionSegment

Properties

WordTiming

Properties

TranscriptionTimings

Properties

Example Usage

Basic Transcription

Access Segments

Word-Level Timestamps

Performance Analysis

Quality Filtering

Export to SRT Format

Monitor Progress

Build docs developers (and LLMs) love

WhisperKit

TTSKit

Core Types

Documentation Index

​Overview

​Class Definition

​Initializer

​Properties

​Computed Properties

​Methods

​logSegments()

​logTimings()

​TranscriptionSegment

​Properties

​WordTiming

​Properties

​TranscriptionTimings

​Properties

​Example Usage

​Basic Transcription

​Access Segments

​Word-Level Timestamps

​Performance Analysis

​Quality Filtering

​Export to SRT Format

​Monitor Progress

Build docs developers (and LLMs) love

Overview

Class Definition

Initializer

Properties

Computed Properties

Methods

logSegments()

logTimings()

TranscriptionSegment

Properties

WordTiming

Properties

TranscriptionTimings

Properties

Example Usage

Basic Transcription

Access Segments

Word-Level Timestamps

Performance Analysis

Quality Filtering

Export to SRT Format

Monitor Progress