RealtimeSTT: Fast Speech-to-Text for Python Applications

RealtimeSTT is a Python speech-to-text library designed for applications that need fast, low-latency voice recognition without sacrificing flexibility. It targets developers building voice assistants, dictation tools, browser streaming servers, and any prototype that needs to turn spoken words into text with only a few lines of code. Key differentiators include a dual-layer voice activity detection (VAD) pipeline using WebRTC VAD and Silero VAD, support for multiple transcription backends through a simple extras system, optional wake word activation, and the ability to ingest audio from any source — not just a microphone.

Key Features

Voice Activity Detection

Dual-layer VAD combining WebRTC VAD for low-latency onset detection with Silero VAD for accurate speech boundary decisions. Configurable sensitivity and silence thresholds let you tune for your acoustic environment.

Real-time Transcription

Stream interim text to your UI while speech is still happening, using a fast lightweight model for live updates and a larger accurate model for the final transcript — all from one recorder instance.

Wake Word Activation

Keep the recorder idle until a trigger phrase is spoken. Supports Porcupine (via pvporcupine) and OpenWakeWord (openwakeword) as pluggable backends, selectable at construction time.

External Audio Input

Disable the microphone and feed raw 16-bit mono PCM chunks directly into the recorder from a file, websocket, process pipeline, or any other audio source. Automatic resampling is available when the source sample rate differs from 16 kHz.

Multiple Transcription Engines

Ships adapters for faster-whisper, whisper.cpp, OpenAI Whisper, Moonshine, sherpa-onnx, Parakeet NeMo, Meta Omnilingual ASR, Granite, Qwen, Cohere Transcribe, and Kroko-ONNX. Switch engines by changing one constructor parameter.

FastAPI Streaming Server

A reference browser streaming server in example_fastapi_server provides multi-user session isolation, shared inference resources, metrics, and health endpoints — ready to clone and extend for production deployments.

Architecture Overview

Audio flows through RealtimeSTT in a straightforward pipeline. Microphone input (or application-fed PCM chunks) enters a ring buffer managed by a background thread. WebRTC VAD continuously inspects each 10 ms frame to detect speech onset with minimal latency, while Silero VAD applies a more accurate neural model to confirm and refine speech boundaries. Once a complete utterance is detected, the buffered audio is handed off to the selected transcription engine — faster-whisper by default — which runs in a separate process to avoid blocking your main thread. The final transcript is returned from text() or delivered to your callback, and optional realtime updates fire on each interim chunk if enable_realtime_transcription is enabled.

Minimal Usage Example

The simplest possible use: record one utterance from the microphone and print the transcript.

from RealtimeSTT import AudioToTextRecorder

if __name__ == "__main__":
    with AudioToTextRecorder() as recorder:
        print(recorder.text())

The with statement starts the recorder, text() blocks until a full utterance is detected, and the context manager cleanly shuts everything down on exit. Always use the if __name__ == "__main__": guard — RealtimeSTT uses multiprocessing internally and the guard is required for correct behaviour, especially on Windows.

Public API

RealtimeSTT exposes six public names from its top-level package:

Export	Description
`AudioToTextRecorder`	Main recorder class. Manages VAD, buffering, transcription, and callbacks. Accepts microphone or application-fed audio.
`AudioToTextRecorderClient`	Thin client that connects to a running `stt-server` process instead of running a local model.
`AudioInput`	Low-level audio input abstraction used internally and available for custom audio source integrations.
`RealtimeSpeechBoundaryDetector`	Core VAD and boundary detection logic, exposed for use outside the full recorder when only boundary detection is needed.
`SpeechBoundaryEvent`	Enum of boundary events emitted by `RealtimeSpeechBoundaryDetector`.
`SpeechBoundaryResult`	Dataclass returned by boundary detection calls, carrying timing and confidence information.

Next Steps

Installation

Set up your Python environment, install the right extras for your platform, and verify the install with a quick smoke test.

Quickstart

Walk through the four core recording patterns: single utterance, continuous dictation, real-time interim text, and external audio feeding.

Get Started

Guides

Transcription Engines

Resources

RealtimeSTT: Fast Speech-to-Text for Python Applications

Key Features

Voice Activity Detection

Real-time Transcription

Wake Word Activation

External Audio Input

Multiple Transcription Engines

FastAPI Streaming Server

Architecture Overview

Minimal Usage Example

Public API

Next Steps

Installation

Quickstart

Build docs developers (and LLMs) love

Get Started

Guides

Transcription Engines

Resources

Documentation Index

​Key Features

Voice Activity Detection

Real-time Transcription

Wake Word Activation

External Audio Input

Multiple Transcription Engines

FastAPI Streaming Server

​Architecture Overview

​Minimal Usage Example

​Public API

​Next Steps

Installation

Quickstart

Build docs developers (and LLMs) love

Key Features

Architecture Overview

Minimal Usage Example

Public API

Next Steps