Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt

Use this file to discover all available pages before exploring further.

RealtimeSTT is a Python speech-to-text library designed for applications that need fast, low-latency voice recognition without sacrificing flexibility. It targets developers building voice assistants, dictation tools, browser streaming servers, and any prototype that needs to turn spoken words into text with only a few lines of code. Key differentiators include a dual-layer voice activity detection (VAD) pipeline using WebRTC VAD and Silero VAD, support for multiple transcription backends through a simple extras system, optional wake word activation, and the ability to ingest audio from any source — not just a microphone.

Key Features

Voice Activity Detection

Dual-layer VAD combining WebRTC VAD for low-latency onset detection with Silero VAD for accurate speech boundary decisions. Configurable sensitivity and silence thresholds let you tune for your acoustic environment.

Real-time Transcription

Stream interim text to your UI while speech is still happening, using a fast lightweight model for live updates and a larger accurate model for the final transcript — all from one recorder instance.

Wake Word Activation

Keep the recorder idle until a trigger phrase is spoken. Supports Porcupine (via pvporcupine) and OpenWakeWord (openwakeword) as pluggable backends, selectable at construction time.

External Audio Input

Disable the microphone and feed raw 16-bit mono PCM chunks directly into the recorder from a file, websocket, process pipeline, or any other audio source. Automatic resampling is available when the source sample rate differs from 16 kHz.

Multiple Transcription Engines

Ships adapters for faster-whisper, whisper.cpp, OpenAI Whisper, Moonshine, sherpa-onnx, Parakeet NeMo, Meta Omnilingual ASR, Granite, Qwen, Cohere Transcribe, and Kroko-ONNX. Switch engines by changing one constructor parameter.

FastAPI Streaming Server

A reference browser streaming server in example_fastapi_server provides multi-user session isolation, shared inference resources, metrics, and health endpoints — ready to clone and extend for production deployments.

Architecture Overview

Audio flows through RealtimeSTT in a straightforward pipeline. Microphone input (or application-fed PCM chunks) enters a ring buffer managed by a background thread. WebRTC VAD continuously inspects each 10 ms frame to detect speech onset with minimal latency, while Silero VAD applies a more accurate neural model to confirm and refine speech boundaries. Once a complete utterance is detected, the buffered audio is handed off to the selected transcription engine — faster-whisper by default — which runs in a separate process to avoid blocking your main thread. The final transcript is returned from text() or delivered to your callback, and optional realtime updates fire on each interim chunk if enable_realtime_transcription is enabled.

Minimal Usage Example

The simplest possible use: record one utterance from the microphone and print the transcript.
from RealtimeSTT import AudioToTextRecorder

if __name__ == "__main__":
    with AudioToTextRecorder() as recorder:
        print(recorder.text())
The with statement starts the recorder, text() blocks until a full utterance is detected, and the context manager cleanly shuts everything down on exit. Always use the if __name__ == "__main__": guard — RealtimeSTT uses multiprocessing internally and the guard is required for correct behaviour, especially on Windows.

Public API

RealtimeSTT exposes six public names from its top-level package:
ExportDescription
AudioToTextRecorderMain recorder class. Manages VAD, buffering, transcription, and callbacks. Accepts microphone or application-fed audio.
AudioToTextRecorderClientThin client that connects to a running stt-server process instead of running a local model.
AudioInputLow-level audio input abstraction used internally and available for custom audio source integrations.
RealtimeSpeechBoundaryDetectorCore VAD and boundary detection logic, exposed for use outside the full recorder when only boundary detection is needed.
SpeechBoundaryEventEnum of boundary events emitted by RealtimeSpeechBoundaryDetector.
SpeechBoundaryResultDataclass returned by boundary detection calls, carrying timing and confidence information.

Next Steps

Installation

Set up your Python environment, install the right extras for your platform, and verify the install with a quick smoke test.

Quickstart

Walk through the four core recording patterns: single utterance, continuous dictation, real-time interim text, and external audio feeding.

Build docs developers (and LLMs) love