Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt

Use this file to discover all available pages before exploring further.

This guide covers the four main usage patterns you will reach for when building with RealtimeSTT: capturing a single utterance, running a continuous dictation loop, streaming real-time interim text, and feeding audio from an external source instead of a microphone. Each pattern builds on the previous one, so read through in order if you are new to the library.
Always wrap runnable scripts in an if __name__ == "__main__": guard. RealtimeSTT uses multiprocessing to run model inference in a separate process, and Python requires this guard to start child processes correctly — especially on Windows, where the default start method is spawn.

Single Utterance

1

Install RealtimeSTT

Install the library with the faster-whisper backend, which is the recommended default for local Whisper transcription:
pip install "RealtimeSTT[faster-whisper]"
2

Import the recorder

AudioToTextRecorder is the only import you need for microphone-based recording:
from RealtimeSTT import AudioToTextRecorder
3

Speak into your microphone

Open the recorder with a context manager. RealtimeSTT starts listening for voice activity immediately. Call text() and it blocks until a full utterance is detected and transcribed:
from RealtimeSTT import AudioToTextRecorder

if __name__ == "__main__":
    with AudioToTextRecorder() as recorder:
        print("Speak now")
        print(recorder.text())
4

Get the final transcript

text() returns a string containing the complete transcript once the utterance ends. The context manager calls shutdown() automatically when the with block exits, cleaning up audio and model resources.

Continuous Dictation Loop

For applications that need to keep listening across multiple utterances, pass a callback to text() instead of collecting the return value directly:
from RealtimeSTT import AudioToTextRecorder


def process_text(text):
    print(text)


if __name__ == "__main__":
    recorder = AudioToTextRecorder()

    while True:
        recorder.text(process_text)
When a callback is supplied, text() dispatches the transcript to process_text asynchronously as transcription finishes, and returns immediately so the while True loop can resume listening right away. This is the preferred form for continuous dictation because the recorder keeps buffering incoming audio during the brief transcription window — you lose no speech between utterances. Without a callback, text() blocks until the transcript is ready and the loop pauses.

Real-time Interim Transcription

Enable enable_realtime_transcription to receive live text updates while the user is still speaking. A fast, lightweight model handles the interim updates; a larger, more accurate model produces the final result once the utterance ends:
from RealtimeSTT import AudioToTextRecorder


def update(text):
    print("live:", text)


if __name__ == "__main__":
    recorder = AudioToTextRecorder(
        enable_realtime_transcription=True,
        on_realtime_transcription_update=update,
        realtime_model_type="tiny.en",
        model="small.en",
    )

    while True:
        print("final:", recorder.text())
on_realtime_transcription_update fires repeatedly as each new interim chunk is produced. text() still returns (or delivers to a callback) the single authoritative final transcript once the utterance is complete. Using a smaller realtime_model_type than model keeps the interim updates fast without sacrificing accuracy in the final result.

External Audio (No Microphone)

Set use_microphone=False when audio arrives from a file, websocket, process pipeline, or any other non-microphone source. Feed raw 16-bit mono PCM chunks with feed_audio():
from RealtimeSTT import AudioToTextRecorder

if __name__ == "__main__":
    recorder = AudioToTextRecorder(use_microphone=False)

    with open("audio_chunk.pcm", "rb") as audio_file:
        recorder.feed_audio(audio_file.read(), original_sample_rate=16000)

    print(recorder.text())
    recorder.shutdown()
Pass original_sample_rate when your source audio is not already at 16 kHz — RealtimeSTT resamples to 16 kHz internally before processing. When not using the context manager, call recorder.shutdown() explicitly to release audio and model resources.

Context Manager vs. Manual Shutdown

Both forms are equivalent; choose based on whether the recorder lifetime matches a single code block. Context manager — use when the recorder starts and stops in the same block. Shutdown is automatic and exception-safe:
from RealtimeSTT import AudioToTextRecorder

if __name__ == "__main__":
    with AudioToTextRecorder() as recorder:
        print(recorder.text())
    # recorder.shutdown() called automatically here
Manual shutdown — use when the recorder is a long-lived object, passed between functions, or conditionally reused:
from RealtimeSTT import AudioToTextRecorder

if __name__ == "__main__":
    recorder = AudioToTextRecorder()
    recorder.start()
    input("Press Enter to stop recording...")
    recorder.stop()
    print(recorder.text())
    recorder.shutdown()
start() and stop() let your application explicitly control when recording begins and ends, rather than relying on automatic VAD-triggered onset detection.

Common Configuration Parameters

The table below covers the parameters most useful during initial development. A full reference for every constructor parameter is in the configuration guide.
ParameterDefaultEffect
model"tiny"Whisper model size for final transcription (tiny, base, small, medium, large-v2, etc.). Smaller models are faster but less accurate.
language"" (auto-detect)ISO 639-1 language code (e.g. "en", "de", "fr"). Set explicitly to skip auto-detection overhead.
enable_realtime_transcriptionFalseEnables live interim text updates via on_realtime_transcription_update.
post_speech_silence_duration0.6Seconds of silence after speech ends before the utterance is considered complete and transcription begins. Lower values feel more responsive; higher values reduce false end-of-speech cuts.
silero_sensitivity0.4Silero VAD sensitivity (0.0–1.0). Higher values require more confident speech to trigger recording; lower values are more permissive.
transcription_engine"faster_whisper"Selects the transcription backend. Other values include "whisper_cpp", "openai_whisper", "moonshine", "kroko_onnx", etc.
A quick example combining several of these:
from RealtimeSTT import AudioToTextRecorder

if __name__ == "__main__":
    recorder = AudioToTextRecorder(
        model="small.en",
        language="en",
        enable_realtime_transcription=True,
        post_speech_silence_duration=0.4,
        silero_sensitivity=0.5,
    )

    while True:
        print(recorder.text())
The full parameter reference — including all VAD timing knobs, callback hooks, wake word options, logging controls, and executor injection — is documented in the configuration reference.

Build docs developers (and LLMs) love