Skip to main content

Streaming Transcription

WhisperKit supports real-time streaming transcription using the AudioStreamTranscriber class. This enables live transcription from the microphone with automatic voice activity detection and segment confirmation.

AudioStreamTranscriber

The AudioStreamTranscriber actor manages the complete streaming pipeline:
  • Captures live audio from the microphone
  • Detects voice activity
  • Transcribes audio in real-time
  • Manages segment confirmation and state updates
See AudioStreamTranscriber

Basic Setup

Initialize Streaming Transcriber

import WhisperKit

let whisperKit = try await WhisperKit()

let streamTranscriber = AudioStreamTranscriber(
    audioEncoder: whisperKit.audioEncoder,
    featureExtractor: whisperKit.featureExtractor,
    segmentSeeker: whisperKit.segmentSeeker,
    textDecoder: whisperKit.textDecoder,
    tokenizer: whisperKit.tokenizer!,
    audioProcessor: whisperKit.audioProcessor,
    decodingOptions: DecodingOptions(),
    stateChangeCallback: { oldState, newState in
        print("Text: \(newState.currentText)")
        print("Confirmed segments: \(newState.confirmedSegments.count)")
    }
)
See AudioStreamTranscriber.init

Start and Stop Streaming

// Start transcription
try await streamTranscriber.startStreamTranscription()

// Transcription runs continuously...

// Stop transcription
await streamTranscriber.stopStreamTranscription()
See AudioStreamTranscriber.swift:73-93

State Management

The AudioStreamTranscriber.State tracks the current transcription state:
public struct State {
    var isRecording: Bool
    var currentFallbacks: Int
    var lastBufferSize: Int
    var lastConfirmedSegmentEndSeconds: Float
    var bufferEnergy: [Float]
    var currentText: String
    var confirmedSegments: [TranscriptionSegment]
    var unconfirmedSegments: [TranscriptionSegment]
    var unconfirmedText: [String]
}
See AudioStreamTranscriber.State

State Properties

isRecording
Bool
Whether audio is currently being recorded and transcribed.
currentText
String
The most recent transcription text (may be unconfirmed).
confirmedSegments
[TranscriptionSegment]
Segments that have been confirmed and are unlikely to change.
unconfirmedSegments
[TranscriptionSegment]
Segments that may still be refined as more audio is processed.
bufferEnergy
[Float]
Audio energy levels for voice activity detection.

Configuration Options

Customize the streaming behavior with initialization parameters:
let streamTranscriber = AudioStreamTranscriber(
    audioEncoder: whisperKit.audioEncoder,
    featureExtractor: whisperKit.featureExtractor,
    segmentSeeker: whisperKit.segmentSeeker,
    textDecoder: whisperKit.textDecoder,
    tokenizer: whisperKit.tokenizer!,
    audioProcessor: whisperKit.audioProcessor,
    decodingOptions: DecodingOptions(
        language: "en",
        wordTimestamps: true
    ),
    requiredSegmentsForConfirmation: 2,  // Segments needed before confirming
    silenceThreshold: 0.3,               // VAD silence threshold
    compressionCheckWindow: 60,          // Token window for hallucination check
    useVAD: true,                        // Enable voice activity detection
    stateChangeCallback: { oldState, newState in
        handleStateChange(newState)
    }
)
See AudioStreamTranscriber.init

Configuration Parameters

requiredSegmentsForConfirmation
Int
default:"2"
Number of segments that must be decoded before earlier segments are confirmed. Higher values provide more stability but increase latency.
silenceThreshold
Float
default:"0.3"
Energy threshold for voice activity detection. Lower values are more sensitive to quiet speech.
compressionCheckWindow
Int
default:"60"
Number of tokens to check for repetition/hallucination. Helps detect when the model is producing gibberish.
useVAD
Bool
default:"true"
Enable voice activity detection to skip silent segments. Improves performance and accuracy.

State Change Callback

The callback receives both old and new states for comparison:
let transcriber = AudioStreamTranscriber(
    // ... other parameters
    stateChangeCallback: { oldState, newState in
        // Check if new segments were confirmed
        if newState.confirmedSegments.count > oldState.confirmedSegments.count {
            let newSegments = newState.confirmedSegments.suffix(
                newState.confirmedSegments.count - oldState.confirmedSegments.count
            )
            
            for segment in newSegments {
                print("Confirmed: \(segment.text)")
                saveToDatabase(segment)
            }
        }
        
        // Update UI with current text (confirmed + unconfirmed)
        let fullText = newState.confirmedSegments.map { $0.text }.joined() +
                      newState.unconfirmedSegments.map { $0.text }.joined()
        updateUI(fullText)
    }
)
See AudioStreamTranscriberCallback

Segment Confirmation

The transcriber uses a sliding window approach:
  1. Audio is continuously buffered and transcribed
  2. New segments are added to unconfirmedSegments
  3. When segment count exceeds requiredSegmentsForConfirmation, earlier segments move to confirmedSegments
  4. Confirmed segments are unlikely to change as more audio is processed
// Example with requiredSegmentsForConfirmation = 2
// 
// Initial state: []
// After 1st transcription: unconfirmed = [seg1]
// After 2nd transcription: unconfirmed = [seg1, seg2]
// After 3rd transcription: confirmed = [seg1], unconfirmed = [seg2, seg3]
// After 4th transcription: confirmed = [seg1, seg2], unconfirmed = [seg3, seg4]
See AudioStreamTranscriber.transcribeCurrentBuffer

Voice Activity Detection

When useVAD is enabled, the transcriber skips silent segments:
// VAD checks relative energy levels
let voiceDetected = AudioProcessor.isVoiceDetected(
    in: audioProcessor.relativeEnergy,
    nextBufferInSeconds: nextBufferSeconds,
    silenceThreshold: silenceThreshold  // 0.3 by default
)

if !voiceDetected {
    // Skip transcription for this buffer
    return
}
See AudioStreamTranscriber.transcribeCurrentBuffer

Benefits of VAD

  • Reduces unnecessary computation during silence
  • Improves transcription accuracy by avoiding false positives
  • Lowers battery consumption
  • Reduces hallucinations from background noise

Early Stopping

The transcriber implements early stopping to prevent hallucinations:
private static func shouldStopEarly(
    progress: TranscriptionProgress,
    options: DecodingOptions,
    compressionCheckWindow: Int
) -> Bool? {
    // Check for high compression ratio (repetition)
    if currentTokens.count > compressionCheckWindow {
        let checkTokens = currentTokens.suffix(compressionCheckWindow)
        let compressionRatio = TextUtilities.compressionRatio(of: checkTokens)
        if compressionRatio > options.compressionRatioThreshold ?? 0.0 {
            return false  // Stop early
        }
    }
    
    // Check for low log probability (low confidence)
    if let avgLogprob = progress.avgLogprob,
       let threshold = options.logProbThreshold {
        if avgLogprob < threshold {
            return false  // Stop early
        }
    }
    
    return nil  // Continue
}
See AudioStreamTranscriber.shouldStopEarly

Complete Example

import SwiftUI
import WhisperKit

class StreamingTranscriptionViewModel: ObservableObject {
    @Published var currentText = ""
    @Published var confirmedText = ""
    @Published var isRecording = false
    
    private var whisperKit: WhisperKit?
    private var streamTranscriber: AudioStreamTranscriber?
    
    func setup() async {
        do {
            whisperKit = try await WhisperKit()
            
            streamTranscriber = AudioStreamTranscriber(
                audioEncoder: whisperKit!.audioEncoder,
                featureExtractor: whisperKit!.featureExtractor,
                segmentSeeker: whisperKit!.segmentSeeker,
                textDecoder: whisperKit!.textDecoder,
                tokenizer: whisperKit!.tokenizer!,
                audioProcessor: whisperKit!.audioProcessor,
                decodingOptions: DecodingOptions(
                    language: "en",
                    task: .transcribe,
                    wordTimestamps: true
                ),
                requiredSegmentsForConfirmation: 3,
                useVAD: true,
                stateChangeCallback: { [weak self] _, newState in
                    Task { @MainActor in
                        self?.isRecording = newState.isRecording
                        self?.currentText = newState.currentText
                        self?.confirmedText = newState.confirmedSegments
                            .map { $0.text }
                            .joined()
                    }
                }
            )
        } catch {
            print("Failed to initialize: \(error)")
        }
    }
    
    func startRecording() async {
        do {
            try await streamTranscriber?.startStreamTranscription()
        } catch {
            print("Failed to start: \(error)")
        }
    }
    
    func stopRecording() async {
        await streamTranscriber?.stopStreamTranscription()
    }
}

struct StreamingView: View {
    @StateObject private var viewModel = StreamingTranscriptionViewModel()
    
    var body: some View {
        VStack {
            Text("Confirmed:")
                .font(.caption)
            Text(viewModel.confirmedText)
                .padding()
                .background(Color.green.opacity(0.1))
            
            Text("Current:")
                .font(.caption)
            Text(viewModel.currentText)
                .padding()
                .background(Color.yellow.opacity(0.1))
            
            Button(viewModel.isRecording ? "Stop" : "Start") {
                Task {
                    if viewModel.isRecording {
                        await viewModel.stopRecording()
                    } else {
                        await viewModel.startRecording()
                    }
                }
            }
            .buttonStyle(.borderedProminent)
        }
        .task {
            await viewModel.setup()
        }
    }
}

Permissions

Streaming transcription requires microphone access:
<!-- Info.plist -->
<key>NSMicrophoneUsageDescription</key>
<string>We need microphone access to transcribe your speech</string>
The transcriber automatically requests permission:
guard await AudioProcessor.requestRecordPermission() else {
    print("Microphone access denied")
    return
}
See AudioStreamTranscriber.startStreamTranscription

Performance Considerations

Model Size

Use smaller models (tiny, base) for real-time streaming. Larger models may not keep up with live audio.

Buffer Management

The transcriber maintains an audio buffer. Long recordings consume more memory.

VAD Optimization

Enable VAD to skip silent portions and reduce computation.

Confirmation Latency

Higher requiredSegmentsForConfirmation improves stability but increases latency before segments are confirmed.

Next Steps

Voice Activity Detection

Deep dive into VAD configuration

Configuration

Advanced configuration options

Build docs developers (and LLMs) love