Overview
The VoiceActivityDetector class is an abstract base class for implementing Voice Activity Detection (VAD). VAD is used to identify segments of audio that contain human speech versus silence or background noise. WhisperKit uses VAD to intelligently chunk long audio files for more efficient transcription.
Class Definition
open class VoiceActivityDetector
Initializer
public init(
sampleRate: Int = 16000,
frameLengthSamples: Int,
frameOverlapSamples: Int = 0
)
Sample rate of the audio signal in Hz
Length of each analysis frame in samples
Number of samples overlapping between consecutive frames
Properties
The sample rate of the audio signal in samples per second (typically 16000 Hz for WhisperKit)
The length of each analysis frame in samples
The number of samples overlapping between consecutive frames. Useful for catching speech at frame boundaries.
Methods
voiceActivity(in:)
Analyzes audio waveform to detect voice activity. Must be implemented by subclasses.
open func voiceActivity(in waveform: [Float]) -> [Bool]
Array of audio samples to analyze
Array where true indicates voice activity and false indicates silence. Each element corresponds to one frame.
voiceActivityAsync(in:)
Async version of voice activity detection.
open func voiceActivityAsync(in waveform: [Float]) async throws -> [Bool]
Array of audio samples to analyze
Array indicating voice activity per frame
calculateActiveChunks(in:)
Identifies continuous segments of audio containing voice activity.
public func calculateActiveChunks(
in waveform: [Float]
) -> [(startIndex: Int, endIndex: Int)]
return
[(startIndex: Int, endIndex: Int)]
Array of tuples containing start and end sample indices for each active segment
voiceActivityIndexToAudioSampleIndex(_:)
Converts a voice activity frame index to an audio sample index.
public func voiceActivityIndexToAudioSampleIndex(_ index: Int) -> Int
Voice activity frame index
Corresponding audio sample index
voiceActivityIndexToSeconds(_:)
Converts a voice activity frame index to time in seconds.
public func voiceActivityIndexToSeconds(_ index: Int) -> Float
Voice activity frame index
Corresponding time in seconds
findLongestSilence(in:)
Finds the longest continuous period of silence.
public func findLongestSilence(
in vadResult: [Bool]
) -> (startIndex: Int, endIndex: Int)?
Voice activity detection results
return
(startIndex: Int, endIndex: Int)?
Start and end indices of the longest silence, or nil if no silence found
voiceActivityClipTimestamps(in:)
Generates timestamp pairs for active audio segments.
func voiceActivityClipTimestamps(in waveform: [Float]) -> [Float]
Flat array of timestamps: [start1, end1, start2, end2, …]
calculateNonSilentSeekClips(in:)
Calculates seek positions for non-silent segments.
func calculateNonSilentSeekClips(
in waveform: [Float]
) -> [(start: Int, end: Int)]
Array of seek clip start/end positions in samples
calculateSeekTimestamps(in:)
Calculates timestamp pairs for seeking through active segments.
func calculateSeekTimestamps(
in waveform: [Float]
) -> [(startTime: Float, endTime: Float)]
return
[(startTime: Float, endTime: Float)]
Array of start/end timestamp pairs in seconds
Built-in Implementation: EnergyVAD
WhisperKit includes a built-in energy-based VAD implementation:
let vad = EnergyVAD(
sampleRate: 16000,
frameLengthSamples: 1600, // 100ms at 16kHz
frameOverlapSamples: 160, // 10ms overlap
energyThreshold: 0.022
)
How EnergyVAD Works
EnergyVAD detects voice activity by:
- Dividing audio into frames
- Calculating RMS energy for each frame
- Comparing energy against a threshold
- Frames above threshold are marked as containing voice
Custom Implementation
To create a custom VAD, subclass VoiceActivityDetector and implement voiceActivity(in:):
class CustomVAD: VoiceActivityDetector {
override func voiceActivity(in waveform: [Float]) -> [Bool] {
let frameCount = waveform.count / frameLengthSamples
var result = [Bool]()
for i in 0..<frameCount {
let start = i * frameLengthSamples
let end = min(start + frameLengthSamples, waveform.count)
let frame = Array(waveform[start..<end])
// Implement your VAD logic here
let hasVoice = detectVoiceInFrame(frame)
result.append(hasVoice)
}
return result
}
private func detectVoiceInFrame(_ frame: [Float]) -> Bool {
// Your custom voice detection logic
return true
}
}
Example Usage
Basic Voice Activity Detection
import WhisperKit
// Load audio
let audioArray = try AudioProcessor.loadAudioAsFloatArray(
fromPath: "/path/to/audio.wav"
)
// Create VAD instance
let vad = EnergyVAD(
sampleRate: 16000,
frameLengthSamples: 1600 // 100ms frames
)
// Detect voice activity
let vadResult = vad.voiceActivity(in: audioArray)
print("Voice activity: \(vadResult)")
Find Active Segments
let vad = EnergyVAD()
let audioArray = try AudioProcessor.loadAudioAsFloatArray(
fromPath: "/path/to/audio.wav"
)
let activeChunks = vad.calculateActiveChunks(in: audioArray)
for (index, chunk) in activeChunks.enumerated() {
let startTime = Float(chunk.startIndex) / 16000.0
let endTime = Float(chunk.endIndex) / 16000.0
let duration = endTime - startTime
print("Segment \(index): \(startTime)s - \(endTime)s (\(duration)s)")
}
Use VAD with WhisperKit
// Configure WhisperKit with VAD-based chunking
let config = WhisperKitConfig(
model: "base",
voiceActivityDetector: EnergyVAD()
)
let whisperKit = try await WhisperKit(config)
// Transcribe with VAD chunking
let options = DecodingOptions(
chunkingStrategy: .vad
)
let results = try await whisperKit.transcribe(
audioPath: "/path/to/long_audio.wav",
decodeOptions: options
)
Visualize Voice Activity
func visualizeVAD(audioArray: [Float], vad: VoiceActivityDetector) {
let vadResult = vad.voiceActivity(in: audioArray)
print("Voice Activity Visualization:")
print("Time (s) | Activity")
print("---------|----------")
for (index, hasVoice) in vadResult.enumerated() {
let time = vad.voiceActivityIndexToSeconds(index)
let marker = hasVoice ? "█████████" : " "
print(String(format: "%6.2f | %@", time, marker))
}
}
let vad = EnergyVAD()
let audio = try AudioProcessor.loadAudioAsFloatArray(
fromPath: "/path/to/audio.wav"
)
visualizeVAD(audioArray: audio, vad: vad)
Find Silence for Split Points
let vad = EnergyVAD()
let audio = try AudioProcessor.loadAudioAsFloatArray(
fromPath: "/path/to/audio.wav"
)
let vadResult = vad.voiceActivity(in: audio)
if let silence = vad.findLongestSilence(in: vadResult) {
let startTime = vad.voiceActivityIndexToSeconds(silence.startIndex)
let endTime = vad.voiceActivityIndexToSeconds(silence.endIndex)
print("Longest silence: \(startTime)s - \(endTime)s")
print("Duration: \(endTime - startTime)s")
// Use this to split audio at a natural break
}
Custom Energy Threshold
// More sensitive (detects quieter speech)
let sensitiveVAD = EnergyVAD(
energyThreshold: 0.01
)
// Less sensitive (only loud speech)
let conservativeVAD = EnergyVAD(
energyThreshold: 0.05
)
// Compare results
let audio = try AudioProcessor.loadAudioAsFloatArray(
fromPath: "/path/to/audio.wav"
)
let sensitiveResult = sensitiveVAD.voiceActivity(in: audio)
let conservativeResult = conservativeVAD.voiceActivity(in: audio)
let sensitiveCount = sensitiveResult.filter { $0 }.count
let conservativeCount = conservativeResult.filter { $0 }.count
print("Sensitive VAD: \(sensitiveCount) active frames")
print("Conservative VAD: \(conservativeCount) active frames")
Process Live Audio with VAD
let processor = AudioProcessor()
let vad = EnergyVAD()
try processor.startRecordingLive { audioBuffer in
// Accumulate enough samples for VAD analysis
let minSamples = vad.frameLengthSamples * 10 // 10 frames
if processor.audioSamples.count >= minSamples {
let samples = Array(processor.audioSamples.suffix(minSamples))
let vadResult = vad.voiceActivity(in: samples)
let voiceFrames = vadResult.filter { $0 }.count
let totalFrames = vadResult.count
let voiceRatio = Float(voiceFrames) / Float(totalFrames)
if voiceRatio > 0.5 {
print("Speech detected (\(Int(voiceRatio * 100))% active)")
} else {
print("Mostly silence (\(Int(voiceRatio * 100))% active)")
}
}
}
Best Practices
-
Frame Length Selection: Use 20-100ms frames (320-1600 samples at 16kHz). Shorter frames = more responsive but noisier.
-
Threshold Tuning: Adjust energy threshold based on your audio:
- Clean recordings: 0.01-0.02
- Noisy environments: 0.03-0.05
- Test with representative samples
-
Overlap for Continuity: Use frame overlap (10-20ms) to avoid missing speech at frame boundaries.
-
Post-Processing: Apply smoothing to reduce false positives:
// Remove isolated voice frames
func smoothVAD(_ vad: [Bool], minLength: Int = 3) -> [Bool] {
var smoothed = vad
for i in 1..<(vad.count - 1) {
if vad[i] && !vad[i-1] && !vad[i+1] {
smoothed[i] = false // Remove single frame
}
}
return smoothed
}
-
Chunking Strategy: When using
.vad chunking strategy with WhisperKit, ensure your VAD is tuned to avoid creating too many tiny segments.