By default, RealtimeSTT delivers a single final transcription after the speaker finishes talking. Real-time transcription mode adds a second stream of interim updates that arrive while the utterance is still in progress — useful for live captioning, streaming dictation UIs, and any application that should display text before the speaker pauses. Interim results are approximate; the final transcription is authoritative.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt
Use this file to discover all available pages before exploring further.
Basic Setup
Setenable_realtime_transcription=True and supply at least one of the two
realtime callbacks:
Using a Separate Realtime Model
By default, RealtimeSTT loads a lightweight"tiny" Whisper model for interim
updates so that the main transcription model stays free for final work. You can
change the realtime model independently:
realtime_processing_pause controls how often the realtime model runs. Lower
values produce more frequent updates at the cost of higher CPU/GPU usage.
Stabilized vs Raw Updates
Two callbacks expose different levels of interim text processing:| Callback | What you receive |
|---|---|
on_realtime_transcription_update | Raw interim transcript — updated every realtime_processing_pause seconds. Can change significantly between calls as context builds. |
on_realtime_transcription_stabilized | Smoothed output that changes more conservatively. Earlier portions of the text are “locked in” as confidence grows, so the display flickers less. |
on_realtime_transcription_update when you want maximum immediacy.
Use on_realtime_transcription_stabilized when display stability matters more
than raw latency.
Syllable Boundary Scheduling
By default, interim transcription fires on a fixed timer (realtime_processing_pause). Enabling syllable boundary detection fires
additional updates at acoustically natural pause points instead — reducing
wasted inference runs when the speaker is in the middle of a word.
realtime_boundary_detector_sensitivity controls how readily a pause is
classified as a syllable boundary. Higher values trigger more frequent updates;
lower values are more conservative and reduce false boundaries in fast speech.
Two-Engine Setup
You can use entirely different backends for final and realtime transcription. This is common when a high-accuracy model handles final results and a faster, lighter model handles interim updates:Key Parameters
| Parameter | Default | Description |
|---|---|---|
enable_realtime_transcription | False | Enables interim transcription while recording is active. |
realtime_model_type | "tiny" | Model name or path used for interim transcription. |
realtime_processing_pause | 0.2 | Seconds between realtime inference attempts. When realtime_transcription_use_syllable_boundaries is True, this becomes a fallback cadence. |
init_realtime_after_seconds | 0.2 | Delay after recording starts before the first interim update fires. |
realtime_batch_size | 16 | Batch size for realtime inference. |
beam_size_realtime | 3 | Beam size for realtime inference where supported. Lower values are faster. |
realtime_transcription_use_syllable_boundaries | False | Fires realtime updates at detected acoustic boundaries instead of (only) on a fixed timer. |
realtime_boundary_detector_sensitivity | 0.6 | Boundary detector sensitivity from 0 (conservative) to 1 (eager). |
use_main_model_for_realtime | False | Reuse the main model for realtime work rather than loading a second model. |
realtime_transcription_engine | None | Backend for realtime transcription. None inherits transcription_engine. |
realtime_transcription_engine_options | None | Engine-specific options for the realtime backend. |
