VoiceActivityDetector

Overview

The VoiceActivityDetector class is an abstract base class for implementing Voice Activity Detection (VAD). VAD is used to identify segments of audio that contain human speech versus silence or background noise. WhisperKit uses VAD to intelligently chunk long audio files for more efficient transcription.

Class Definition

open class VoiceActivityDetector

Initializer

public init(
    sampleRate: Int = 16000,
    frameLengthSamples: Int,
    frameOverlapSamples: Int = 0
)

sampleRate

Int

default:"16000"

Sample rate of the audio signal in Hz

frameLengthSamples

Int

Length of each analysis frame in samples

frameOverlapSamples

Int

default:"0"

Number of samples overlapping between consecutive frames

Properties

sampleRate

Int

The sample rate of the audio signal in samples per second (typically 16000 Hz for WhisperKit)

frameLengthSamples

Int

The length of each analysis frame in samples

frameOverlapSamples

Int

The number of samples overlapping between consecutive frames. Useful for catching speech at frame boundaries.

Methods

voiceActivity(in:)

Analyzes audio waveform to detect voice activity. Must be implemented by subclasses.

open func voiceActivity(in waveform: [Float]) -> [Bool]

waveform

[Float]

Array of audio samples to analyze

return

[Bool]

Array where true indicates voice activity and false indicates silence. Each element corresponds to one frame.

voiceActivityAsync(in:)

Async version of voice activity detection.

open func voiceActivityAsync(in waveform: [Float]) async throws -> [Bool]

waveform

[Float]

Array of audio samples to analyze

return

[Bool]

Array indicating voice activity per frame

calculateActiveChunks(in:)

Identifies continuous segments of audio containing voice activity.

public func calculateActiveChunks(
    in waveform: [Float]
) -> [(startIndex: Int, endIndex: Int)]

waveform

[Float]

Array of audio samples

return

[(startIndex: Int, endIndex: Int)]

Array of tuples containing start and end sample indices for each active segment

voiceActivityIndexToAudioSampleIndex(_:)

Converts a voice activity frame index to an audio sample index.

public func voiceActivityIndexToAudioSampleIndex(_ index: Int) -> Int

index

Int

Voice activity frame index

return

Int

Corresponding audio sample index

voiceActivityIndexToSeconds(_:)

Converts a voice activity frame index to time in seconds.

public func voiceActivityIndexToSeconds(_ index: Int) -> Float

index

Int

Voice activity frame index

return

Float

Corresponding time in seconds

findLongestSilence(in:)

Finds the longest continuous period of silence.

public func findLongestSilence(
    in vadResult: [Bool]
) -> (startIndex: Int, endIndex: Int)?

vadResult

[Bool]

Voice activity detection results

return

(startIndex: Int, endIndex: Int)?

Start and end indices of the longest silence, or nil if no silence found

voiceActivityClipTimestamps(in:)

Generates timestamp pairs for active audio segments.

func voiceActivityClipTimestamps(in waveform: [Float]) -> [Float]

return

[Float]

Flat array of timestamps: [start1, end1, start2, end2, …]

calculateNonSilentSeekClips(in:)

Calculates seek positions for non-silent segments.

func calculateNonSilentSeekClips(
    in waveform: [Float]
) -> [(start: Int, end: Int)]

return

[(start: Int, end: Int)]

Array of seek clip start/end positions in samples

calculateSeekTimestamps(in:)

Calculates timestamp pairs for seeking through active segments.

func calculateSeekTimestamps(
    in waveform: [Float]
) -> [(startTime: Float, endTime: Float)]

return

[(startTime: Float, endTime: Float)]

Array of start/end timestamp pairs in seconds

Built-in Implementation: EnergyVAD

WhisperKit includes a built-in energy-based VAD implementation:

let vad = EnergyVAD(
    sampleRate: 16000,
    frameLengthSamples: 1600,  // 100ms at 16kHz
    frameOverlapSamples: 160,   // 10ms overlap
    energyThreshold: 0.022
)

How EnergyVAD Works

EnergyVAD detects voice activity by:

Dividing audio into frames
Calculating RMS energy for each frame
Comparing energy against a threshold
Frames above threshold are marked as containing voice

Custom Implementation

To create a custom VAD, subclass VoiceActivityDetector and implement voiceActivity(in:):

class CustomVAD: VoiceActivityDetector {
    override func voiceActivity(in waveform: [Float]) -> [Bool] {
        let frameCount = waveform.count / frameLengthSamples
        var result = [Bool]()
        
        for i in 0..<frameCount {
            let start = i * frameLengthSamples
            let end = min(start + frameLengthSamples, waveform.count)
            let frame = Array(waveform[start..<end])
            
            // Implement your VAD logic here
            let hasVoice = detectVoiceInFrame(frame)
            result.append(hasVoice)
        }
        
        return result
    }
    
    private func detectVoiceInFrame(_ frame: [Float]) -> Bool {
        // Your custom voice detection logic
        return true
    }
}

Example Usage

Basic Voice Activity Detection

import WhisperKit

// Load audio
let audioArray = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

// Create VAD instance
let vad = EnergyVAD(
    sampleRate: 16000,
    frameLengthSamples: 1600  // 100ms frames
)

// Detect voice activity
let vadResult = vad.voiceActivity(in: audioArray)
print("Voice activity: \(vadResult)")

Find Active Segments

let vad = EnergyVAD()
let audioArray = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

let activeChunks = vad.calculateActiveChunks(in: audioArray)

for (index, chunk) in activeChunks.enumerated() {
    let startTime = Float(chunk.startIndex) / 16000.0
    let endTime = Float(chunk.endIndex) / 16000.0
    let duration = endTime - startTime
    
    print("Segment \(index): \(startTime)s - \(endTime)s (\(duration)s)")
}

Use VAD with WhisperKit

// Configure WhisperKit with VAD-based chunking
let config = WhisperKitConfig(
    model: "base",
    voiceActivityDetector: EnergyVAD()
)

let whisperKit = try await WhisperKit(config)

// Transcribe with VAD chunking
let options = DecodingOptions(
    chunkingStrategy: .vad
)

let results = try await whisperKit.transcribe(
    audioPath: "/path/to/long_audio.wav",
    decodeOptions: options
)

Visualize Voice Activity

func visualizeVAD(audioArray: [Float], vad: VoiceActivityDetector) {
    let vadResult = vad.voiceActivity(in: audioArray)
    
    print("Voice Activity Visualization:")
    print("Time (s) | Activity")
    print("---------|----------")
    
    for (index, hasVoice) in vadResult.enumerated() {
        let time = vad.voiceActivityIndexToSeconds(index)
        let marker = hasVoice ? "█████████" : "         "
        print(String(format: "%6.2f   | %@", time, marker))
    }
}

let vad = EnergyVAD()
let audio = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

visualizeVAD(audioArray: audio, vad: vad)

Find Silence for Split Points

let vad = EnergyVAD()
let audio = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

let vadResult = vad.voiceActivity(in: audio)

if let silence = vad.findLongestSilence(in: vadResult) {
    let startTime = vad.voiceActivityIndexToSeconds(silence.startIndex)
    let endTime = vad.voiceActivityIndexToSeconds(silence.endIndex)
    
    print("Longest silence: \(startTime)s - \(endTime)s")
    print("Duration: \(endTime - startTime)s")
    
    // Use this to split audio at a natural break
}

Custom Energy Threshold

// More sensitive (detects quieter speech)
let sensitiveVAD = EnergyVAD(
    energyThreshold: 0.01
)

// Less sensitive (only loud speech)
let conservativeVAD = EnergyVAD(
    energyThreshold: 0.05
)

// Compare results
let audio = try AudioProcessor.loadAudioAsFloatArray(
    fromPath: "/path/to/audio.wav"
)

let sensitiveResult = sensitiveVAD.voiceActivity(in: audio)
let conservativeResult = conservativeVAD.voiceActivity(in: audio)

let sensitiveCount = sensitiveResult.filter { $0 }.count
let conservativeCount = conservativeResult.filter { $0 }.count

print("Sensitive VAD: \(sensitiveCount) active frames")
print("Conservative VAD: \(conservativeCount) active frames")

Process Live Audio with VAD

let processor = AudioProcessor()
let vad = EnergyVAD()

try processor.startRecordingLive { audioBuffer in
    // Accumulate enough samples for VAD analysis
    let minSamples = vad.frameLengthSamples * 10  // 10 frames
    
    if processor.audioSamples.count >= minSamples {
        let samples = Array(processor.audioSamples.suffix(minSamples))
        let vadResult = vad.voiceActivity(in: samples)
        
        let voiceFrames = vadResult.filter { $0 }.count
        let totalFrames = vadResult.count
        let voiceRatio = Float(voiceFrames) / Float(totalFrames)
        
        if voiceRatio > 0.5 {
            print("Speech detected (\(Int(voiceRatio * 100))% active)")
        } else {
            print("Mostly silence (\(Int(voiceRatio * 100))% active)")
        }
    }
}

Best Practices

Frame Length Selection: Use 20-100ms frames (320-1600 samples at 16kHz). Shorter frames = more responsive but noisier.
Threshold Tuning: Adjust energy threshold based on your audio:
- Clean recordings: 0.01-0.02
- Noisy environments: 0.03-0.05
- Test with representative samples
Overlap for Continuity: Use frame overlap (10-20ms) to avoid missing speech at frame boundaries.

Post-Processing: Apply smoothing to reduce false positives:

// Remove isolated voice frames
func smoothVAD(_ vad: [Bool], minLength: Int = 3) -> [Bool] {
    var smoothed = vad
    for i in 1..<(vad.count - 1) {
        if vad[i] && !vad[i-1] && !vad[i+1] {
            smoothed[i] = false  // Remove single frame
        }
    }
    return smoothed
}

Chunking Strategy: When using .vad chunking strategy with WhisperKit, ensure your VAD is tuned to avoid creating too many tiny segments.

WhisperKit

TTSKit

Core Types

VoiceActivityDetector

Overview

Class Definition

Initializer

Properties

Methods

voiceActivity(in:)

voiceActivityAsync(in:)

calculateActiveChunks(in:)

voiceActivityIndexToAudioSampleIndex(_:)

voiceActivityIndexToSeconds(_:)

findLongestSilence(in:)

voiceActivityClipTimestamps(in:)

calculateNonSilentSeekClips(in:)

calculateSeekTimestamps(in:)

Built-in Implementation: EnergyVAD

How EnergyVAD Works

Custom Implementation

Example Usage

Basic Voice Activity Detection

Find Active Segments

Use VAD with WhisperKit

Visualize Voice Activity

Find Silence for Split Points

Custom Energy Threshold

Process Live Audio with VAD

Best Practices

Build docs developers (and LLMs) love

WhisperKit

TTSKit

Core Types

Documentation Index

​Overview

​Class Definition

​Initializer

​Properties

​Methods

​voiceActivity(in:)

​voiceActivityAsync(in:)

​calculateActiveChunks(in:)

​voiceActivityIndexToAudioSampleIndex(_:)

​voiceActivityIndexToSeconds(_:)

​findLongestSilence(in:)

​voiceActivityClipTimestamps(in:)

​calculateNonSilentSeekClips(in:)

​calculateSeekTimestamps(in:)

​Built-in Implementation: EnergyVAD

​How EnergyVAD Works

​Custom Implementation

​Example Usage

​Basic Voice Activity Detection

​Find Active Segments

​Use VAD with WhisperKit

​Visualize Voice Activity

​Find Silence for Split Points

​Custom Energy Threshold

​Process Live Audio with VAD

​Best Practices

Build docs developers (and LLMs) love

Overview

Class Definition

Initializer

Properties

Methods

voiceActivity(in:)

voiceActivityAsync(in:)

calculateActiveChunks(in:)

voiceActivityIndexToAudioSampleIndex(_:)

voiceActivityIndexToSeconds(_:)

findLongestSilence(in:)

voiceActivityClipTimestamps(in:)

calculateNonSilentSeekClips(in:)

calculateSeekTimestamps(in:)

Built-in Implementation: EnergyVAD

How EnergyVAD Works

Custom Implementation

Example Usage

Basic Voice Activity Detection

Find Active Segments

Use VAD with WhisperKit

Visualize Voice Activity

Find Silence for Split Points

Custom Energy Threshold

Process Live Audio with VAD

Best Practices