Streaming Transcription

WhisperKit supports real-time streaming transcription using the AudioStreamTranscriber class. This enables live transcription from the microphone with automatic voice activity detection and segment confirmation.

AudioStreamTranscriber

The AudioStreamTranscriber actor manages the complete streaming pipeline:

Captures live audio from the microphone
Detects voice activity
Transcribes audio in real-time
Manages segment confirmation and state updates

See AudioStreamTranscriber

Basic Setup

Initialize Streaming Transcriber

import WhisperKit

let whisperKit = try await WhisperKit()

let streamTranscriber = AudioStreamTranscriber(
    audioEncoder: whisperKit.audioEncoder,
    featureExtractor: whisperKit.featureExtractor,
    segmentSeeker: whisperKit.segmentSeeker,
    textDecoder: whisperKit.textDecoder,
    tokenizer: whisperKit.tokenizer!,
    audioProcessor: whisperKit.audioProcessor,
    decodingOptions: DecodingOptions(),
    stateChangeCallback: { oldState, newState in
        print("Text: \(newState.currentText)")
        print("Confirmed segments: \(newState.confirmedSegments.count)")
    }
)

See AudioStreamTranscriber.init

Start and Stop Streaming

// Start transcription
try await streamTranscriber.startStreamTranscription()

// Transcription runs continuously...

// Stop transcription
await streamTranscriber.stopStreamTranscription()

See AudioStreamTranscriber.swift:73-93

State Management

The AudioStreamTranscriber.State tracks the current transcription state:

public struct State {
    var isRecording: Bool
    var currentFallbacks: Int
    var lastBufferSize: Int
    var lastConfirmedSegmentEndSeconds: Float
    var bufferEnergy: [Float]
    var currentText: String
    var confirmedSegments: [TranscriptionSegment]
    var unconfirmedSegments: [TranscriptionSegment]
    var unconfirmedText: [String]
}

See AudioStreamTranscriber.State

State Properties

isRecording

Bool

Whether audio is currently being recorded and transcribed.

currentText

String

The most recent transcription text (may be unconfirmed).

confirmedSegments

[TranscriptionSegment]

Segments that have been confirmed and are unlikely to change.

unconfirmedSegments

[TranscriptionSegment]

Segments that may still be refined as more audio is processed.

bufferEnergy

[Float]

Audio energy levels for voice activity detection.

Configuration Options

Customize the streaming behavior with initialization parameters:

let streamTranscriber = AudioStreamTranscriber(
    audioEncoder: whisperKit.audioEncoder,
    featureExtractor: whisperKit.featureExtractor,
    segmentSeeker: whisperKit.segmentSeeker,
    textDecoder: whisperKit.textDecoder,
    tokenizer: whisperKit.tokenizer!,
    audioProcessor: whisperKit.audioProcessor,
    decodingOptions: DecodingOptions(
        language: "en",
        wordTimestamps: true
    ),
    requiredSegmentsForConfirmation: 2,  // Segments needed before confirming
    silenceThreshold: 0.3,               // VAD silence threshold
    compressionCheckWindow: 60,          // Token window for hallucination check
    useVAD: true,                        // Enable voice activity detection
    stateChangeCallback: { oldState, newState in
        handleStateChange(newState)
    }
)

See AudioStreamTranscriber.init

Configuration Parameters

requiredSegmentsForConfirmation

Int

default:"2"

Number of segments that must be decoded before earlier segments are confirmed. Higher values provide more stability but increase latency.

silenceThreshold

Float

default:"0.3"

Energy threshold for voice activity detection. Lower values are more sensitive to quiet speech.

compressionCheckWindow

Int

default:"60"

Number of tokens to check for repetition/hallucination. Helps detect when the model is producing gibberish.

useVAD

Bool

default:"true"

Enable voice activity detection to skip silent segments. Improves performance and accuracy.

State Change Callback

The callback receives both old and new states for comparison:

let transcriber = AudioStreamTranscriber(
    // ... other parameters
    stateChangeCallback: { oldState, newState in
        // Check if new segments were confirmed
        if newState.confirmedSegments.count > oldState.confirmedSegments.count {
            let newSegments = newState.confirmedSegments.suffix(
                newState.confirmedSegments.count - oldState.confirmedSegments.count
            )
            
            for segment in newSegments {
                print("Confirmed: \(segment.text)")
                saveToDatabase(segment)
            }
        }
        
        // Update UI with current text (confirmed + unconfirmed)
        let fullText = newState.confirmedSegments.map { $0.text }.joined() +
                      newState.unconfirmedSegments.map { $0.text }.joined()
        updateUI(fullText)
    }
)

See AudioStreamTranscriberCallback

Segment Confirmation

The transcriber uses a sliding window approach:

Audio is continuously buffered and transcribed
New segments are added to unconfirmedSegments
When segment count exceeds requiredSegmentsForConfirmation, earlier segments move to confirmedSegments
Confirmed segments are unlikely to change as more audio is processed

// Example with requiredSegmentsForConfirmation = 2
// 
// Initial state: []
// After 1st transcription: unconfirmed = [seg1]
// After 2nd transcription: unconfirmed = [seg1, seg2]
// After 3rd transcription: confirmed = [seg1], unconfirmed = [seg2, seg3]
// After 4th transcription: confirmed = [seg1, seg2], unconfirmed = [seg3, seg4]

See AudioStreamTranscriber.transcribeCurrentBuffer

Voice Activity Detection

When useVAD is enabled, the transcriber skips silent segments:

// VAD checks relative energy levels
let voiceDetected = AudioProcessor.isVoiceDetected(
    in: audioProcessor.relativeEnergy,
    nextBufferInSeconds: nextBufferSeconds,
    silenceThreshold: silenceThreshold  // 0.3 by default
)

if !voiceDetected {
    // Skip transcription for this buffer
    return
}

See AudioStreamTranscriber.transcribeCurrentBuffer

Benefits of VAD

Reduces unnecessary computation during silence
Improves transcription accuracy by avoiding false positives
Lowers battery consumption
Reduces hallucinations from background noise

Early Stopping

The transcriber implements early stopping to prevent hallucinations:

private static func shouldStopEarly(
    progress: TranscriptionProgress,
    options: DecodingOptions,
    compressionCheckWindow: Int
) -> Bool? {
    // Check for high compression ratio (repetition)
    if currentTokens.count > compressionCheckWindow {
        let checkTokens = currentTokens.suffix(compressionCheckWindow)
        let compressionRatio = TextUtilities.compressionRatio(of: checkTokens)
        if compressionRatio > options.compressionRatioThreshold ?? 0.0 {
            return false  // Stop early
        }
    }
    
    // Check for low log probability (low confidence)
    if let avgLogprob = progress.avgLogprob,
       let threshold = options.logProbThreshold {
        if avgLogprob < threshold {
            return false  // Stop early
        }
    }
    
    return nil  // Continue
}

See AudioStreamTranscriber.shouldStopEarly

Complete Example

import SwiftUI
import WhisperKit

class StreamingTranscriptionViewModel: ObservableObject {
    @Published var currentText = ""
    @Published var confirmedText = ""
    @Published var isRecording = false
    
    private var whisperKit: WhisperKit?
    private var streamTranscriber: AudioStreamTranscriber?
    
    func setup() async {
        do {
            whisperKit = try await WhisperKit()
            
            streamTranscriber = AudioStreamTranscriber(
                audioEncoder: whisperKit!.audioEncoder,
                featureExtractor: whisperKit!.featureExtractor,
                segmentSeeker: whisperKit!.segmentSeeker,
                textDecoder: whisperKit!.textDecoder,
                tokenizer: whisperKit!.tokenizer!,
                audioProcessor: whisperKit!.audioProcessor,
                decodingOptions: DecodingOptions(
                    language: "en",
                    task: .transcribe,
                    wordTimestamps: true
                ),
                requiredSegmentsForConfirmation: 3,
                useVAD: true,
                stateChangeCallback: { [weak self] _, newState in
                    Task { @MainActor in
                        self?.isRecording = newState.isRecording
                        self?.currentText = newState.currentText
                        self?.confirmedText = newState.confirmedSegments
                            .map { $0.text }
                            .joined()
                    }
                }
            )
        } catch {
            print("Failed to initialize: \(error)")
        }
    }
    
    func startRecording() async {
        do {
            try await streamTranscriber?.startStreamTranscription()
        } catch {
            print("Failed to start: \(error)")
        }
    }
    
    func stopRecording() async {
        await streamTranscriber?.stopStreamTranscription()
    }
}

struct StreamingView: View {
    @StateObject private var viewModel = StreamingTranscriptionViewModel()
    
    var body: some View {
        VStack {
            Text("Confirmed:")
                .font(.caption)
            Text(viewModel.confirmedText)
                .padding()
                .background(Color.green.opacity(0.1))
            
            Text("Current:")
                .font(.caption)
            Text(viewModel.currentText)
                .padding()
                .background(Color.yellow.opacity(0.1))
            
            Button(viewModel.isRecording ? "Stop" : "Start") {
                Task {
                    if viewModel.isRecording {
                        await viewModel.stopRecording()
                    } else {
                        await viewModel.startRecording()
                    }
                }
            }
            .buttonStyle(.borderedProminent)
        }
        .task {
            await viewModel.setup()
        }
    }
}

Permissions

Streaming transcription requires microphone access:

<!-- Info.plist -->
<key>NSMicrophoneUsageDescription</key>
<string>We need microphone access to transcribe your speech</string>

The transcriber automatically requests permission:

guard await AudioProcessor.requestRecordPermission() else {
    print("Microphone access denied")
    return
}

See AudioStreamTranscriber.startStreamTranscription

Performance Considerations

Model Size

Use smaller models (tiny, base) for real-time streaming. Larger models may not keep up with live audio.

Buffer Management

The transcriber maintains an audio buffer. Long recordings consume more memory.

VAD Optimization

Enable VAD to skip silent portions and reduce computation.

Confirmation Latency

Higher requiredSegmentsForConfirmation improves stability but increases latency before segments are confirmed.

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Streaming Transcription

Streaming Transcription

AudioStreamTranscriber

Basic Setup

Initialize Streaming Transcriber

Start and Stop Streaming

State Management

State Properties

Configuration Options

Configuration Parameters

State Change Callback

Segment Confirmation

Voice Activity Detection

Benefits of VAD

Early Stopping

Complete Example

Permissions

Performance Considerations

Model Size

Buffer Management

VAD Optimization

Confirmation Latency

Next Steps

Voice Activity Detection

Configuration

Build docs developers (and LLMs) love

Get Started

WhisperKit (Speech-to-Text)

TTSKit (Text-to-Speech)

Advanced

Examples

Documentation Index

​Streaming Transcription

​AudioStreamTranscriber

​Basic Setup

​Initialize Streaming Transcriber

​Start and Stop Streaming

​State Management

​State Properties

​Configuration Options

​Configuration Parameters

​State Change Callback

​Segment Confirmation

​Voice Activity Detection

​Benefits of VAD

​Early Stopping

​Complete Example

​Permissions

​Performance Considerations

Model Size

Buffer Management

VAD Optimization

Confirmation Latency

​Next Steps

Voice Activity Detection

Configuration

Build docs developers (and LLMs) love

Streaming Transcription

AudioStreamTranscriber

Basic Setup

Initialize Streaming Transcriber

Start and Stop Streaming

State Management

State Properties

Configuration Options

Configuration Parameters

State Change Callback

Segment Confirmation

Voice Activity Detection

Benefits of VAD

Early Stopping

Complete Example

Permissions

Performance Considerations

Next Steps