Skip to main content

Overview

The TranscriptionResult class represents the output of a transcription operation. It contains the transcribed text, detailed segment information, language detection results, and performance timing data.

Class Definition

open class TranscriptionResult: Codable, @unchecked Sendable

Initializer

public init(
    text: String,
    segments: [TranscriptionSegment],
    language: String,
    timings: TranscriptionTimings,
    seekTime: Float? = nil
)
text
String
Complete transcribed text
segments
[TranscriptionSegment]
Array of transcription segments with timestamps and metadata
language
String
Detected or specified language code
timings
TranscriptionTimings
Performance timing information
seekTime
Float?
Seek time offset in seconds (for chunked audio)

Properties

text
String
The complete transcribed text. All segments are concatenated together.
segments
[TranscriptionSegment]
Array of transcription segments, each containing:
  • Text content
  • Start and end timestamps
  • Token information
  • Quality metrics (log probabilities, compression ratio)
  • Optional word-level timestamps
language
String
ISO 639-1 language code (e.g., “en” for English, “es” for Spanish) detected or specified for this transcription.
timings
TranscriptionTimings
Detailed performance metrics including:
  • Model loading time
  • Audio processing time
  • Encoding time
  • Decoding time
  • Total pipeline duration
  • Real-time factor
seekTime
Float?
Seek time offset in seconds when this result is part of a chunked transcription.

Computed Properties

allWords
[WordTiming]
Flat array of all word-level timings across all segments. Only populated when wordTimestamps is enabled in DecodingOptions.

Methods

logSegments()

Logs all segments with timestamps and text to the console.
public func logSegments()
Output Format:
[Segment 0] [00:00.00 --> 00:03.50] Hello, this is a test.
[Segment 1] [00:03.50 --> 00:07.20] This is the second segment.

logTimings()

Logs detailed performance timing information to the console.
public func logTimings()
Output includes:
  • Audio loading time
  • Audio processing time
  • Mel spectrogram computation time
  • Encoding time
  • Decoding time breakdown
  • Total pipeline duration
  • Tokens per second
  • Real-time factor

TranscriptionSegment

Each segment in the segments array contains detailed information:
public struct TranscriptionSegment: Hashable, Codable, Sendable

Properties

id
Int
Unique identifier for the segment
seek
Int
Seek position in the audio (in samples)
start
Float
Start timestamp in seconds
end
Float
End timestamp in seconds
text
String
Transcribed text for this segment
tokens
[Int]
Token IDs generated for this segment
tokenLogProbs
[[Int: Float]]
Log probabilities for each token
temperature
Float
Sampling temperature used for this segment
avgLogprob
Float
Average log probability of all tokens (quality indicator)
compressionRatio
Float
Text compression ratio (detects repetitive output)
noSpeechProb
Float
Probability that this segment contains no speech
words
[WordTiming]?
Optional array of word-level timings (only when wordTimestamps is enabled)
duration
Float
Computed duration of the segment (end - start)

WordTiming

When word-level timestamps are enabled, each word includes:
public struct WordTiming: Hashable, Codable, Sendable

Properties

word
String
The word text
tokens
[Int]
Token IDs that comprise this word
start
Float
Start timestamp in seconds
end
Float
End timestamp in seconds
probability
Float
Confidence probability for this word
duration
Float
Computed duration (end - start)

TranscriptionTimings

Detailed performance metrics:
public struct TranscriptionTimings: Codable, Sendable

Properties

modelLoading
TimeInterval
Total time spent loading models
audioLoading
TimeInterval
Time spent loading and converting audio
audioProcessing
TimeInterval
Time spent processing audio samples
logmels
TimeInterval
Time spent computing mel spectrograms
encoding
TimeInterval
Time spent in audio encoder
decodingPredictions
TimeInterval
Time spent in text decoder predictions
decodingLoop
TimeInterval
Total time spent in decoding loop
fullPipeline
TimeInterval
Total end-to-end pipeline duration
tokensPerSecond
Double
Computed: tokens generated per second
realTimeFactor
Double
Computed: ratio of processing time to audio duration (< 1.0 means faster than real-time)
speedFactor
Double
Computed: inverse of real-time factor (> 1.0 means faster than real-time)

Example Usage

Basic Transcription

import WhisperKit

let whisperKit = try await WhisperKit()
let results = try await whisperKit.transcribe(
    audioPath: "/path/to/audio.wav"
)

for result in results {
    print("Language: \(result.language)")
    print("Text: \(result.text)")
    print("Segments: \(result.segments.count)")
}

Access Segments

for segment in result.segments {
    print("[\(segment.start)s - \(segment.end)s]: \(segment.text)")
    print("  Confidence: \(segment.avgLogprob)")
    print("  No speech prob: \(segment.noSpeechProb)")
}

Word-Level Timestamps

let options = DecodingOptions(
    wordTimestamps: true
)

let results = try await whisperKit.transcribe(
    audioPath: "/path/to/audio.wav",
    decodeOptions: options
)

for result in results {
    for word in result.allWords {
        print("\(word.word) [\(word.start)s - \(word.end)s]")
    }
}

Performance Analysis

let result = results.first!

print("Real-time factor: \(result.timings.realTimeFactor)")
if result.timings.realTimeFactor < 1.0 {
    print("Processing faster than real-time!")
}

print("Tokens per second: \(result.timings.tokensPerSecond)")
print("Total time: \(result.timings.fullPipeline)s")
print("Audio duration: \(result.timings.inputAudioSeconds)s")

// Log detailed breakdown
result.logTimings()

Quality Filtering

// Filter out low-quality segments
let highQualitySegments = result.segments.filter { segment in
    segment.avgLogprob > -0.5 &&  // Good confidence
    segment.noSpeechProb < 0.3 &&  // Low silence probability
    segment.compressionRatio < 2.0  // Not repetitive
}

for segment in highQualitySegments {
    print(segment.text)
}

Export to SRT Format

func exportToSRT(result: TranscriptionResult) -> String {
    var srt = ""
    
    for (index, segment) in result.segments.enumerated() {
        let startTime = formatTimestamp(segment.start)
        let endTime = formatTimestamp(segment.end)
        
        srt += "\(index + 1)\n"
        srt += "\(startTime) --> \(endTime)\n"
        srt += "\(segment.text.trimmingCharacters(in: .whitespaces))\n\n"
    }
    
    return srt
}

func formatTimestamp(_ seconds: Float) -> String {
    let hours = Int(seconds) / 3600
    let minutes = (Int(seconds) % 3600) / 60
    let secs = Int(seconds) % 60
    let millis = Int((seconds - Float(Int(seconds))) * 1000)
    
    return String(format: "%02d:%02d:%02d,%03d", hours, minutes, secs, millis)
}

// Usage
let srtContent = exportToSRT(result: results.first!)
print(srtContent)

Monitor Progress

let results = try await whisperKit.transcribe(
    audioPath: "/path/to/audio.wav"
) { progress in
    print("Current text: \(progress.text)")
    print("Tokens: \(progress.tokens.count)")
    if let temp = progress.temperature {
        print("Temperature: \(temp)")
    }
    return true  // Continue
}

Build docs developers (and LLMs) love