Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/xdcobra/react-native-sherpa-onnx/llms.txt

Use this file to discover all available pages before exploring further.

This feature is coming in version 0.7.0 and is not yet available in the current release.

Overview

Voice Activity Detection (VAD) will enable real-time detection of speech vs. silence in audio streams. This is essential for:
  • Automatic silence removal in recordings
  • Speech segmentation before transcription
  • Reducing unnecessary processing during silent periods
  • Triggering speech recognition only when needed

Planned Features

Real-time Detection

Detect voice activity as audio streams in

Low Latency

Minimal processing delay for responsive apps

Silence Removal

Automatically skip non-speech segments

Speech Segmentation

Split audio into speech and non-speech regions

Expected API (Preview)

While the API is not finalized, the expected interface will be:
import { createVAD } from 'react-native-sherpa-onnx/vad';

// Create VAD engine
const vad = await createVAD({
  modelPath: { type: 'asset', path: 'models/silero-vad' },
  sampleRate: 16000,
  windowSize: 512,
});

// Process audio chunks
const isSpeech = await vad.detectSpeech(samples);

if (isSpeech) {
  // Forward to STT or other processing
  processAudio(samples);
}

// Cleanup
await vad.destroy();

Use Cases

1. Efficient Recording

Only save or process audio segments containing speech:
// Planned API
const recorder = startRecording();

recorder.on('chunk', async (samples) => {
  const isSpeech = await vad.detectSpeech(samples);
  
  if (isSpeech) {
    // Only process speech segments
    await processAudioChunk(samples);
  }
});

2. Pre-processing for STT

Segment continuous audio before transcription:
// Planned API
const segments = await vad.segmentAudio(audioFile);

for (const segment of segments) {
  if (segment.isSpeech) {
    const result = await stt.transcribeSamples(
      segment.samples,
      segment.sampleRate
    );
    console.log(result.text);
  }
}

3. Wake Word Detection

Trigger STT only when speech is detected:
// Planned API
const stream = await createAudioStream();

stream.on('data', async (samples) => {
  const isSpeech = await vad.detectSpeech(samples);
  
  if (isSpeech) {
    // Start transcription
    await sttStream.acceptWaveform(samples, 16000);
  }
});

Planned Configuration

// Expected configuration options
interface VADConfig {
  modelPath: ModelPathConfig;
  sampleRate: number;        // 8000, 16000 (default), 32000, 48000
  windowSize: number;        // Samples per window (e.g., 512, 1024)
  threshold: number;         // Speech confidence threshold (0..1)
  minSpeechDuration: number; // Minimum speech length (ms)
  minSilenceDuration: number; // Minimum silence to split (ms)
}

Expected Models

Likely model support:
  • Silero VAD - Lightweight, efficient, ONNX-based
  • WebRTC VAD - Classic algorithm
  • Custom models - Via sherpa-onnx framework

Timeline

VAD support is planned for:
1

Version 0.7.0

Initial VAD implementation with basic detection
2

Future versions

Advanced features like adaptive thresholds and multi-language support

Stay Updated

To track progress or contribute:

Current Workarounds

While VAD is not available, you can:
  1. Use streaming STT with endpoint detection - The streaming STT API already includes basic endpoint detection
  2. External libraries - Use JavaScript audio analysis libraries
  3. Manual silence detection - Implement simple amplitude-based detection

Simple Amplitude Detection

function detectSilence(samples: number[], threshold: number = 0.01): boolean {
  const rms = Math.sqrt(
    samples.reduce((sum, val) => sum + val * val, 0) / samples.length
  );
  return rms < threshold;
}

// Usage
const samples = getPcmSamples();
const isSilent = detectSilence(samples);

if (!isSilent) {
  // Process audio
}

Streaming STT

Real-time transcription with endpoint detection

Speech Enhancement

Noise reduction (coming in v0.5.0)

Build docs developers (and LLMs) love