Streaming Transcription
WhisperKit supports real-time streaming transcription using the AudioStreamTranscriber class. This enables live transcription from the microphone with automatic voice activity detection and segment confirmation.
AudioStreamTranscriber
The AudioStreamTranscriber actor manages the complete streaming pipeline:
Captures live audio from the microphone
Detects voice activity
Transcribes audio in real-time
Manages segment confirmation and state updates
See AudioStreamTranscriber
Basic Setup
Initialize Streaming Transcriber
import WhisperKit
let whisperKit = try await WhisperKit ()
let streamTranscriber = AudioStreamTranscriber (
audioEncoder : whisperKit. audioEncoder ,
featureExtractor : whisperKit. featureExtractor ,
segmentSeeker : whisperKit. segmentSeeker ,
textDecoder : whisperKit. textDecoder ,
tokenizer : whisperKit. tokenizer ! ,
audioProcessor : whisperKit. audioProcessor ,
decodingOptions : DecodingOptions (),
stateChangeCallback : { oldState, newState in
print ( "Text: \( newState. currentText ) " )
print ( "Confirmed segments: \( newState. confirmedSegments . count ) " )
}
)
See AudioStreamTranscriber.init
Start and Stop Streaming
// Start transcription
try await streamTranscriber. startStreamTranscription ()
// Transcription runs continuously...
// Stop transcription
await streamTranscriber. stopStreamTranscription ()
See AudioStreamTranscriber.swift:73-93
State Management
The AudioStreamTranscriber.State tracks the current transcription state:
public struct State {
var isRecording: Bool
var currentFallbacks: Int
var lastBufferSize: Int
var lastConfirmedSegmentEndSeconds: Float
var bufferEnergy: [ Float ]
var currentText: String
var confirmedSegments: [TranscriptionSegment]
var unconfirmedSegments: [TranscriptionSegment]
var unconfirmedText: [ String ]
}
See AudioStreamTranscriber.State
State Properties
Whether audio is currently being recorded and transcribed.
The most recent transcription text (may be unconfirmed).
Segments that have been confirmed and are unlikely to change.
Segments that may still be refined as more audio is processed.
Audio energy levels for voice activity detection.
Configuration Options
Customize the streaming behavior with initialization parameters:
let streamTranscriber = AudioStreamTranscriber (
audioEncoder : whisperKit. audioEncoder ,
featureExtractor : whisperKit. featureExtractor ,
segmentSeeker : whisperKit. segmentSeeker ,
textDecoder : whisperKit. textDecoder ,
tokenizer : whisperKit. tokenizer ! ,
audioProcessor : whisperKit. audioProcessor ,
decodingOptions : DecodingOptions (
language : "en" ,
wordTimestamps : true
),
requiredSegmentsForConfirmation : 2 , // Segments needed before confirming
silenceThreshold : 0.3 , // VAD silence threshold
compressionCheckWindow : 60 , // Token window for hallucination check
useVAD : true , // Enable voice activity detection
stateChangeCallback : { oldState, newState in
handleStateChange (newState)
}
)
See AudioStreamTranscriber.init
Configuration Parameters
requiredSegmentsForConfirmation
Number of segments that must be decoded before earlier segments are confirmed. Higher values provide more stability but increase latency.
Energy threshold for voice activity detection. Lower values are more sensitive to quiet speech.
Number of tokens to check for repetition/hallucination. Helps detect when the model is producing gibberish.
Enable voice activity detection to skip silent segments. Improves performance and accuracy.
State Change Callback
The callback receives both old and new states for comparison:
let transcriber = AudioStreamTranscriber (
// ... other parameters
stateChangeCallback : { oldState, newState in
// Check if new segments were confirmed
if newState.confirmedSegments. count > oldState.confirmedSegments. count {
let newSegments = newState. confirmedSegments . suffix (
newState. confirmedSegments . count - oldState. confirmedSegments . count
)
for segment in newSegments {
print ( "Confirmed: \( segment. text ) " )
saveToDatabase (segment)
}
}
// Update UI with current text (confirmed + unconfirmed)
let fullText = newState. confirmedSegments . map { $0 . text }. joined () +
newState. unconfirmedSegments . map { $0 . text }. joined ()
updateUI (fullText)
}
)
See AudioStreamTranscriberCallback
Segment Confirmation
The transcriber uses a sliding window approach:
Audio is continuously buffered and transcribed
New segments are added to unconfirmedSegments
When segment count exceeds requiredSegmentsForConfirmation, earlier segments move to confirmedSegments
Confirmed segments are unlikely to change as more audio is processed
// Example with requiredSegmentsForConfirmation = 2
//
// Initial state: []
// After 1st transcription: unconfirmed = [seg1]
// After 2nd transcription: unconfirmed = [seg1, seg2]
// After 3rd transcription: confirmed = [seg1], unconfirmed = [seg2, seg3]
// After 4th transcription: confirmed = [seg1, seg2], unconfirmed = [seg3, seg4]
See AudioStreamTranscriber.transcribeCurrentBuffer
Voice Activity Detection
When useVAD is enabled, the transcriber skips silent segments:
// VAD checks relative energy levels
let voiceDetected = AudioProcessor. isVoiceDetected (
in : audioProcessor. relativeEnergy ,
nextBufferInSeconds : nextBufferSeconds,
silenceThreshold : silenceThreshold // 0.3 by default
)
if ! voiceDetected {
// Skip transcription for this buffer
return
}
See AudioStreamTranscriber.transcribeCurrentBuffer
Benefits of VAD
Reduces unnecessary computation during silence
Improves transcription accuracy by avoiding false positives
Lowers battery consumption
Reduces hallucinations from background noise
Early Stopping
The transcriber implements early stopping to prevent hallucinations:
private static func shouldStopEarly (
progress : TranscriptionProgress,
options : DecodingOptions,
compressionCheckWindow : Int
) -> Bool ? {
// Check for high compression ratio (repetition)
if currentTokens. count > compressionCheckWindow {
let checkTokens = currentTokens. suffix (compressionCheckWindow)
let compressionRatio = TextUtilities. compressionRatio ( of : checkTokens)
if compressionRatio > options.compressionRatioThreshold ?? 0.0 {
return false // Stop early
}
}
// Check for low log probability (low confidence)
if let avgLogprob = progress.avgLogprob,
let threshold = options.logProbThreshold {
if avgLogprob < threshold {
return false // Stop early
}
}
return nil // Continue
}
See AudioStreamTranscriber.shouldStopEarly
Complete Example
import SwiftUI
import WhisperKit
class StreamingTranscriptionViewModel : ObservableObject {
@Published var currentText = ""
@Published var confirmedText = ""
@Published var isRecording = false
private var whisperKit: WhisperKit ?
private var streamTranscriber: AudioStreamTranscriber ?
func setup () async {
do {
whisperKit = try await WhisperKit ()
streamTranscriber = AudioStreamTranscriber (
audioEncoder : whisperKit ! . audioEncoder ,
featureExtractor : whisperKit ! . featureExtractor ,
segmentSeeker : whisperKit ! . segmentSeeker ,
textDecoder : whisperKit ! . textDecoder ,
tokenizer : whisperKit ! . tokenizer ! ,
audioProcessor : whisperKit ! . audioProcessor ,
decodingOptions : DecodingOptions (
language : "en" ,
task : . transcribe ,
wordTimestamps : true
),
requiredSegmentsForConfirmation : 3 ,
useVAD : true ,
stateChangeCallback : { [ weak self ] _ , newState in
Task { @MainActor in
self ? . isRecording = newState. isRecording
self ? . currentText = newState. currentText
self ? . confirmedText = newState. confirmedSegments
. map { $0 . text }
. joined ()
}
}
)
} catch {
print ( "Failed to initialize: \( error ) " )
}
}
func startRecording () async {
do {
try await streamTranscriber ? . startStreamTranscription ()
} catch {
print ( "Failed to start: \( error ) " )
}
}
func stopRecording () async {
await streamTranscriber ? . stopStreamTranscription ()
}
}
struct StreamingView : View {
@StateObject private var viewModel = StreamingTranscriptionViewModel ()
var body: some View {
VStack {
Text ( "Confirmed:" )
. font (. caption )
Text (viewModel. confirmedText )
. padding ()
. background (Color. green . opacity ( 0.1 ))
Text ( "Current:" )
. font (. caption )
Text (viewModel. currentText )
. padding ()
. background (Color. yellow . opacity ( 0.1 ))
Button (viewModel. isRecording ? "Stop" : "Start" ) {
Task {
if viewModel.isRecording {
await viewModel. stopRecording ()
} else {
await viewModel. startRecording ()
}
}
}
. buttonStyle (. borderedProminent )
}
. task {
await viewModel. setup ()
}
}
}
Permissions
Streaming transcription requires microphone access:
<!-- Info.plist -->
< key > NSMicrophoneUsageDescription </ key >
< string > We need microphone access to transcribe your speech </ string >
The transcriber automatically requests permission:
guard await AudioProcessor. requestRecordPermission () else {
print ( "Microphone access denied" )
return
}
See AudioStreamTranscriber.startStreamTranscription
Model Size Use smaller models (tiny, base) for real-time streaming. Larger models may not keep up with live audio.
Buffer Management The transcriber maintains an audio buffer. Long recordings consume more memory.
VAD Optimization Enable VAD to skip silent portions and reduce computation.
Confirmation Latency Higher requiredSegmentsForConfirmation improves stability but increases latency before segments are confirmed.
Next Steps
Voice Activity Detection Deep dive into VAD configuration
Configuration Advanced configuration options