Voice Activity Detection (VAD) is the first stage of every pipeline run. The handler continuously receives raw 16-bit PCM audio, feeds it through Silero VAD v5, and emits audio segments only when genuine speech is detected. This keeps the downstream STT and LLM stages quiet during silence, cuts transcription latency, and enables the barge-in / speculative turn-reopening flows that make the realtime mode feel natural. TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/huggingface/speech-to-speech/llms.txt
Use this file to discover all available pages before exploring further.
VADHandler wraps a VADIterator (a stateful sliding-window wrapper around the Silero model) and adds several pipeline-specific mechanisms on top: a speech-start hysteresis threshold, progressive audio release for live transcription, short-segment stitching, optional DeepFilterNet audio enhancement, and the speculative turn-reopening protocol used by realtime mode.
How VAD gates the pipeline
- Each audio chunk (raw bytes from the socket, WebSocket, or local mic) is converted to float32 and passed to
VADIterator. - When the iterator transitions from silent to triggered, the handler starts accumulating the speech buffer.
- Speech is confirmed and
SpeechStartedEventis emitted only once the accumulated active speech (samples above the VAD threshold) reachesmin_speech_ms. This suppresses short noise bursts. - When the iterator transitions back to silent for at least
min_silence_ms, the accumulated buffer is yielded as aVADAudiomessage to the STT queue. - Audio is padded with
speech_pad_msmilliseconds of pre-speech context so the STT model hears the beginning of each utterance clearly.
Live / progressive transcription
When--enable_realtime_transcription is set, the handler also yields VADAudio(mode="progressive") chunks at the interval set by --realtime_processing_pause. The STT stage transcribes these in parallel and emits PartialTranscription messages, which the realtime server forwards as conversation.item.input_audio_transcription.delta events.
Speculative turn-reopening
Realtime mode keeps a soft-ended turn reopenable forspeculative_reopen_ms after the user stops speaking. If the user resumes within that window (without the assistant having committed a response), the new speech is appended to the same turn rather than starting a new one. This eliminates false cuts caused by brief pauses mid-sentence.
For turns that have not yet received any assistant output, the window is extended to unanswered_reopen_ms (default 7 s), so a long thinking pause does not orphan the user’s question.
Continuation speech (within the reopen window) only needs to pass the lower min_speech_continuation_ms bar, not the full min_speech_ms bar, reducing latency for short trailing fragments.
Barge-in / interruption
When the user starts speaking while the assistant is generating audio, the VAD emits aSpeechStartedEvent with interrupt_response=True. The pipeline’s cancel scope marks the current generation stale, the TTS handler drops in-flight audio, and the should_listen event is re-set so the next VAD segment is picked up normally.
Configuration reference
All VAD parameters are inVADHandlerArguments and can be passed directly on the CLI.
The confidence threshold for Silero VAD. Values range from 0 to 1; higher values require the model to be more certain before triggering. The default of 0.6 works well for most microphone setups.
Sample rate of the incoming audio in Hz. Silero VAD is trained at 16 kHz; change this only if you are resampling before the VAD stage.
Minimum length of silence (in milliseconds) before the current speech segment is considered ended. Lower values cut sentences faster and improve barge-in responsiveness; higher values let the user pause briefly without triggering a cut.
Minimum active speech duration (ms) required to confirm a new utterance. Segments shorter than this are discarded as noise. For barge-in, this full bar must be crossed.
Lower hysteresis bar (ms) for speech that continues a soft-ended, uncommitted turn within the reopen window. Set to
0 to disable the split and always use min_speech_ms. Clamped to the range [100, min_speech_ms]. The recommended pairing is --min_speech_ms 384 --min_speech_continuation_ms 192.Maximum continuous speech duration before a forced segment split. Default is infinite (no forced splits). Useful in server deployments to bound memory use for very long utterances.
Amount of audio (ms) retained before the VAD trigger and prepended to each detected speech segment. This pre-speech context helps the STT model hear the very beginning of each utterance without cutting off the first phoneme.
When
True, applies DeepFilterNet noise reduction, equalization, and echo cancellation to each speech segment before passing it to the STT stage. Requires pip install deepfilternet (not compatible with Pocket TTS; see note below).Enable progressive audio release during speech so the STT stage can emit partial transcripts in real time. Required for the
conversation.item.input_audio_transcription.delta event stream in realtime mode.Interval in seconds between progressive audio chunk emissions during speech. Smaller values give more frequent partial updates at the cost of more STT calls. Automatically backed off for long speech segments.
How long (ms, measured on the audio clock) a soft-ended turn stays reopenable once the assistant has started responding. Resumed speech within this window continues the same turn instead of starting a new one.
Sanity cap (ms) for reopening a soft-ended turn that has not yet received any assistant output. Extends the reopen window for unanswered turns so a pause longer than
speculative_reopen_ms does not orphan a question the model has not yet replied to. Has no effect below speculative_reopen_ms.When greater than
0, adjacent VAD segments that are each shorter than min_speech_ms are held and stitched together for up to this many milliseconds before being discarded. Fragments shorter than 100 ms of active speech are never held. Useful when min_silence_ms is very low for snappier barge-in detection.Example: tuning VAD for low-latency barge-in
Example: conservative settings for noisy environments
--audio_enhancement requires pip install deepfilternet and is incompatible with Pocket TTS (which requires numpy>=2, while DeepFilterNet requires numpy<2). Install DeepFilterNet only in environments that do not use Pocket TTS.