Skip to main content

Overview

The VoiceActivityDetector class is an abstract base class for implementing Voice Activity Detection (VAD). VAD is used to identify segments of audio that contain human speech versus silence or background noise. WhisperKit uses VAD to intelligently chunk long audio files for more efficient transcription.

Class Definition

open class VoiceActivityDetector

Initializer

public init(
    sampleRate: Int = 16000,
    frameLengthSamples: Int,
    frameOverlapSamples: Int = 0
)
sampleRate
Int
default:"16000"
Sample rate of the audio signal in Hz
frameLengthSamples
Int
Length of each analysis frame in samples
frameOverlapSamples
Int
default:"0"
Number of samples overlapping between consecutive frames

Properties

sampleRate
Int
The sample rate of the audio signal in samples per second (typically 16000 Hz for WhisperKit)
frameLengthSamples
Int
The length of each analysis frame in samples
frameOverlapSamples
Int
The number of samples overlapping between consecutive frames. Useful for catching speech at frame boundaries.

Methods

voiceActivity(in:)

Analyzes audio waveform to detect voice activity. Must be implemented by subclasses.
open func voiceActivity(in waveform: [Float]) -> [Bool]
waveform
[Float]
Array of audio samples to analyze
return
[Bool]
Array where true indicates voice activity and false indicates silence. Each element corresponds to one frame.

voiceActivityAsync(in:)

Async version of voice activity detection.
open func voiceActivityAsync(in waveform: [Float]) async throws -> [Bool]
waveform
[Float]
Array of audio samples to analyze
return
[Bool]
Array indicating voice activity per frame

calculateActiveChunks(in:)

Identifies continuous segments of audio containing voice activity.
public func calculateActiveChunks(
    in waveform: [Float]
) -> [(startIndex: Int, endIndex: Int)]
waveform
[Float]
Array of audio samples
return
[(startIndex: Int, endIndex: Int)]
Array of tuples containing start and end sample indices for each active segment

voiceActivityIndexToAudioSampleIndex(_:)

Converts a voice activity frame index to an audio sample index.
public func voiceActivityIndexToAudioSampleIndex(_ index: Int) -> Int
index
Int
Voice activity frame index
return
Int
Corresponding audio sample index

voiceActivityIndexToSeconds(_:)

Converts a voice activity frame index to time in seconds.
public func voiceActivityIndexToSeconds(_ index: Int) -> Float
index
Int
Voice activity frame index
return
Float
Corresponding time in seconds

findLongestSilence(in:)

Finds the longest continuous period of silence.
public func findLongestSilence(
    in vadResult: [Bool]
) -> (startIndex: Int, endIndex: Int)?
vadResult
[Bool]
Voice activity detection results
return
(startIndex: Int, endIndex: Int)?
Start and end indices of the longest silence, or nil if no silence found

voiceActivityClipTimestamps(in:)

Generates timestamp pairs for active audio segments.
func voiceActivityClipTimestamps(in waveform: [Float]) -> [Float]
return
[Float]
Flat array of timestamps: [start1, end1, start2, end2, …]

calculateNonSilentSeekClips(in:)

Calculates seek positions for non-silent segments.
func calculateNonSilentSeekClips(
    in waveform: [Float]
) -> [(start: Int, end: Int)]
return
[(start: Int, end: Int)]
Array of seek clip start/end positions in samples

calculateSeekTimestamps(in:)

Calculates timestamp pairs for seeking through active segments.
func calculateSeekTimestamps(
    in waveform: [Float]
) -> [(startTime: Float, endTime: Float)]
return
[(startTime: Float, endTime: Float)]
Array of start/end timestamp pairs in seconds

Built-in Implementation: EnergyVAD

WhisperKit includes a built-in energy-based VAD implementation:
let vad = EnergyVAD(
    sampleRate: 16000,
    frameLengthSamples: 1600,  // 100ms at 16kHz
    frameOverlapSamples: 160,   // 10ms overlap
    energyThreshold: 0.022
)

How EnergyVAD Works

EnergyVAD detects voice activity by:
  1. Dividing audio into frames
  2. Calculating RMS energy for each frame
  3. Comparing energy against a threshold
  4. Frames above threshold are marked as containing voice

Custom Implementation

To create a custom VAD, subclass VoiceActivityDetector and implement voiceActivity(in:):
class CustomVAD: VoiceActivityDetector {
    override func voiceActivity(in waveform: [Float]) -> [Bool] {
        let frameCount = waveform.count / frameLengthSamples
        var result = [Bool]()
        
        for i in 0..<frameCount {
            let start = i * frameLengthSamples
            let end = min(start + frameLengthSamples, waveform.count)
            let frame = Array(waveform[start..<end])
            
            // Implement your VAD logic here
            let hasVoice = detectVoiceInFrame(frame)
            result.append(hasVoice)
        }
        
        return result
    }
    
    private func detectVoiceInFrame(_ frame: [Float]) -> Bool {
        // Your custom voice detection logic
        return true
    }
}

Example Usage

Basic Voice Activity Detection

import WhisperKit

// Load audio
let audioArray = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

// Create VAD instance
let vad = EnergyVAD(
    sampleRate: 16000,
    frameLengthSamples: 1600  // 100ms frames
)

// Detect voice activity
let vadResult = vad.voiceActivity(in: audioArray)
print("Voice activity: \(vadResult)")

Find Active Segments

let vad = EnergyVAD()
let audioArray = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

let activeChunks = vad.calculateActiveChunks(in: audioArray)

for (index, chunk) in activeChunks.enumerated() {
    let startTime = Float(chunk.startIndex) / 16000.0
    let endTime = Float(chunk.endIndex) / 16000.0
    let duration = endTime - startTime
    
    print("Segment \(index): \(startTime)s - \(endTime)s (\(duration)s)")
}

Use VAD with WhisperKit

// Configure WhisperKit with VAD-based chunking
let config = WhisperKitConfig(
    model: "base",
    voiceActivityDetector: EnergyVAD()
)

let whisperKit = try await WhisperKit(config)

// Transcribe with VAD chunking
let options = DecodingOptions(
    chunkingStrategy: .vad
)

let results = try await whisperKit.transcribe(
    audioPath: "/path/to/long_audio.wav",
    decodeOptions: options
)

Visualize Voice Activity

func visualizeVAD(audioArray: [Float], vad: VoiceActivityDetector) {
    let vadResult = vad.voiceActivity(in: audioArray)
    
    print("Voice Activity Visualization:")
    print("Time (s) | Activity")
    print("---------|----------")
    
    for (index, hasVoice) in vadResult.enumerated() {
        let time = vad.voiceActivityIndexToSeconds(index)
        let marker = hasVoice ? "█████████" : "         "
        print(String(format: "%6.2f   | %@", time, marker))
    }
}

let vad = EnergyVAD()
let audio = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

visualizeVAD(audioArray: audio, vad: vad)

Find Silence for Split Points

let vad = EnergyVAD()
let audio = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

let vadResult = vad.voiceActivity(in: audio)

if let silence = vad.findLongestSilence(in: vadResult) {
    let startTime = vad.voiceActivityIndexToSeconds(silence.startIndex)
    let endTime = vad.voiceActivityIndexToSeconds(silence.endIndex)
    
    print("Longest silence: \(startTime)s - \(endTime)s")
    print("Duration: \(endTime - startTime)s")
    
    // Use this to split audio at a natural break
}

Custom Energy Threshold

// More sensitive (detects quieter speech)
let sensitiveVAD = EnergyVAD(
    energyThreshold: 0.01
)

// Less sensitive (only loud speech)
let conservativeVAD = EnergyVAD(
    energyThreshold: 0.05
)

// Compare results
let audio = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

let sensitiveResult = sensitiveVAD.voiceActivity(in: audio)
let conservativeResult = conservativeVAD.voiceActivity(in: audio)

let sensitiveCount = sensitiveResult.filter { $0 }.count
let conservativeCount = conservativeResult.filter { $0 }.count

print("Sensitive VAD: \(sensitiveCount) active frames")
print("Conservative VAD: \(conservativeCount) active frames")

Process Live Audio with VAD

let processor = AudioProcessor()
let vad = EnergyVAD()

try processor.startRecordingLive { audioBuffer in
    // Accumulate enough samples for VAD analysis
    let minSamples = vad.frameLengthSamples * 10  // 10 frames
    
    if processor.audioSamples.count >= minSamples {
        let samples = Array(processor.audioSamples.suffix(minSamples))
        let vadResult = vad.voiceActivity(in: samples)
        
        let voiceFrames = vadResult.filter { $0 }.count
        let totalFrames = vadResult.count
        let voiceRatio = Float(voiceFrames) / Float(totalFrames)
        
        if voiceRatio > 0.5 {
            print("Speech detected (\(Int(voiceRatio * 100))% active)")
        } else {
            print("Mostly silence (\(Int(voiceRatio * 100))% active)")
        }
    }
}

Best Practices

  1. Frame Length Selection: Use 20-100ms frames (320-1600 samples at 16kHz). Shorter frames = more responsive but noisier.
  2. Threshold Tuning: Adjust energy threshold based on your audio:
    • Clean recordings: 0.01-0.02
    • Noisy environments: 0.03-0.05
    • Test with representative samples
  3. Overlap for Continuity: Use frame overlap (10-20ms) to avoid missing speech at frame boundaries.
  4. Post-Processing: Apply smoothing to reduce false positives:
    // Remove isolated voice frames
    func smoothVAD(_ vad: [Bool], minLength: Int = 3) -> [Bool] {
        var smoothed = vad
        for i in 1..<(vad.count - 1) {
            if vad[i] && !vad[i-1] && !vad[i+1] {
                smoothed[i] = false  // Remove single frame
            }
        }
        return smoothed
    }
    
  5. Chunking Strategy: When using .vad chunking strategy with WhisperKit, ensure your VAD is tuned to avoid creating too many tiny segments.

Build docs developers (and LLMs) love