RealtimeSTT is a Python speech-to-text library designed for applications that need fast, low-latency voice recognition without sacrificing flexibility. It targets developers building voice assistants, dictation tools, browser streaming servers, and any prototype that needs to turn spoken words into text with only a few lines of code. Key differentiators include a dual-layer voice activity detection (VAD) pipeline using WebRTC VAD and Silero VAD, support for multiple transcription backends through a simple extras system, optional wake word activation, and the ability to ingest audio from any source — not just a microphone.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt
Use this file to discover all available pages before exploring further.
Key Features
Voice Activity Detection
Dual-layer VAD combining WebRTC VAD for low-latency onset detection with
Silero VAD for accurate speech boundary decisions. Configurable sensitivity
and silence thresholds let you tune for your acoustic environment.
Real-time Transcription
Stream interim text to your UI while speech is still happening, using a
fast lightweight model for live updates and a larger accurate model for the
final transcript — all from one recorder instance.
Wake Word Activation
Keep the recorder idle until a trigger phrase is spoken. Supports
Porcupine (via
pvporcupine) and OpenWakeWord (openwakeword) as
pluggable backends, selectable at construction time.External Audio Input
Disable the microphone and feed raw 16-bit mono PCM chunks directly into
the recorder from a file, websocket, process pipeline, or any other audio
source. Automatic resampling is available when the source sample rate
differs from 16 kHz.
Multiple Transcription Engines
Ships adapters for faster-whisper, whisper.cpp, OpenAI Whisper,
Moonshine, sherpa-onnx, Parakeet NeMo, Meta Omnilingual ASR, Granite,
Qwen, Cohere Transcribe, and Kroko-ONNX. Switch engines by changing one
constructor parameter.
FastAPI Streaming Server
A reference browser streaming server in
example_fastapi_server provides
multi-user session isolation, shared inference resources, metrics, and
health endpoints — ready to clone and extend for production deployments.Architecture Overview
Audio flows through RealtimeSTT in a straightforward pipeline. Microphone input (or application-fed PCM chunks) enters a ring buffer managed by a background thread. WebRTC VAD continuously inspects each 10 ms frame to detect speech onset with minimal latency, while Silero VAD applies a more accurate neural model to confirm and refine speech boundaries. Once a complete utterance is detected, the buffered audio is handed off to the selected transcription engine — faster-whisper by default — which runs in a separate process to avoid blocking your main thread. The final transcript is returned fromtext() or delivered to your callback, and optional realtime updates fire on each interim chunk if enable_realtime_transcription is enabled.
Minimal Usage Example
The simplest possible use: record one utterance from the microphone and print the transcript.with statement starts the recorder, text() blocks until a full
utterance is detected, and the context manager cleanly shuts everything down on
exit. Always use the if __name__ == "__main__": guard — RealtimeSTT uses
multiprocessing internally and the guard is required for correct behaviour,
especially on Windows.
Public API
RealtimeSTT exposes six public names from its top-level package:| Export | Description |
|---|---|
AudioToTextRecorder | Main recorder class. Manages VAD, buffering, transcription, and callbacks. Accepts microphone or application-fed audio. |
AudioToTextRecorderClient | Thin client that connects to a running stt-server process instead of running a local model. |
AudioInput | Low-level audio input abstraction used internally and available for custom audio source integrations. |
RealtimeSpeechBoundaryDetector | Core VAD and boundary detection logic, exposed for use outside the full recorder when only boundary detection is needed. |
SpeechBoundaryEvent | Enum of boundary events emitted by RealtimeSpeechBoundaryDetector. |
SpeechBoundaryResult | Dataclass returned by boundary detection calls, carrying timing and confidence information. |
Next Steps
Installation
Set up your Python environment, install the right extras for your platform,
and verify the install with a quick smoke test.
Quickstart
Walk through the four core recording patterns: single utterance, continuous
dictation, real-time interim text, and external audio feeding.
