Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt

Use this file to discover all available pages before exploring further.

RealtimeSTT’s default mode reads audio from the local microphone. When audio comes from somewhere else — a file, a WebSocket connection, a browser stream, a telephony server, another process, or a test fixture — set use_microphone=False and push audio into the recorder yourself by calling feed_audio(). The recorder queues each chunk, runs VAD, and produces transcriptions exactly as it would with a live microphone.

Basic Setup

Construct the recorder with use_microphone=False, then call feed_audio() with raw PCM bytes. Call recorder.text() to block until the next final utterance is ready, and call recorder.shutdown() when the stream is finished.
from RealtimeSTT import AudioToTextRecorder

if __name__ == "__main__":
    recorder = AudioToTextRecorder(use_microphone=False)

    with open("audio_chunk.pcm", "rb") as audio_file:
        recorder.feed_audio(audio_file.read(), original_sample_rate=16000)

    print(recorder.text())
    recorder.shutdown()

Audio Format Requirements

feed_audio() expects raw PCM audio in the following format:
PropertyRequired value
Encoding16-bit signed PCM
ChannelsMono (1 channel)
Sample rate16 000 Hz
Byte orderLittle-endian
recorder.feed_audio(chunk, original_sample_rate=16000)
If your source audio is not at 16 kHz, pass the actual sample rate in the original_sample_rate argument and RealtimeSTT will resample the chunk before processing it. For example, browser microphone audio is commonly 48 kHz:
recorder.feed_audio(pcm_bytes, original_sample_rate=48000)
The original_sample_rate parameter only handles sample-rate conversion. Your audio must still be mono and 16-bit PCM before calling feed_audio(). Stereo or floating-point audio will produce garbled transcriptions.
For chunk sizing, aim for roughly 100 ms of audio per call — about 3 200 bytes at 16 kHz. This gives VAD enough signal to react quickly without fragmenting the stream into too many tiny allocations.
Feeding the entire audio file as a single very large chunk delays VAD detection until the whole buffer has been processed. Prefer smaller chunks so voice activity and realtime updates can respond in near real-time.

Feeding Audio from a File

The following example reads a binary PCM file in 100 ms chunks and feeds each one to the recorder:
from RealtimeSTT import AudioToTextRecorder

CHUNK_BYTES = 3200  # 100 ms of 16-bit mono PCM at 16 kHz

if __name__ == "__main__":
    recorder = AudioToTextRecorder(use_microphone=False)

    with open("audio_stream.pcm", "rb") as audio_file:
        while True:
            chunk = audio_file.read(CHUNK_BYTES)
            if not chunk:
                break
            recorder.feed_audio(chunk, original_sample_rate=16000)

    print(recorder.text())
    recorder.shutdown()
Make sure the file includes enough trailing silence for VAD to finalize the last utterance. If text() never returns, try appending a second or two of silence to the stream, or lower post_speech_silence_duration in the constructor.

Streaming Audio

For live sources such as WebSocket connections or inter-process pipes, feed chunks as they arrive and call text() on whatever application thread should wait for each final utterance:
def handle_pcm_packet(recorder, pcm_bytes, sample_rate):
    recorder.feed_audio(pcm_bytes, original_sample_rate=sample_rate)
The same pattern works for any iterator of PCM chunks:
from RealtimeSTT import AudioToTextRecorder


def pcm_chunks_from_process():
    while True:
        chunk = read_next_chunk_somehow()
        if not chunk:
            break
        yield chunk


if __name__ == "__main__":
    recorder = AudioToTextRecorder(use_microphone=False)

    for chunk in pcm_chunks_from_process():
        recorder.feed_audio(chunk, original_sample_rate=16000)

    print(recorder.text())
    recorder.shutdown()
Replace read_next_chunk_somehow() with your pipe, socket, queue, or media framework integration. feed_audio() does not reorder chunks, so preserve arrival order when reading from parallel sources.

Realtime Updates with External Audio

Enable enable_realtime_transcription to receive live interim text while audio is still arriving:
from RealtimeSTT import AudioToTextRecorder


def on_live(text):
    print("live:", text)


if __name__ == "__main__":
    recorder = AudioToTextRecorder(
        use_microphone=False,
        enable_realtime_transcription=True,
        on_realtime_transcription_update=on_live,
    )

    for chunk in audio_chunks():
        recorder.feed_audio(chunk, original_sample_rate=48000)

    print("final:", recorder.text())
    recorder.shutdown()

Shutdown

When you use a context manager, __exit__ calls shutdown() automatically:
with AudioToTextRecorder(use_microphone=False) as recorder:
    recorder.feed_audio(data)
    print(recorder.text())
If you construct the recorder outside a with block — as is typical when feeding audio in a loop — call shutdown() explicitly when the stream ends:
recorder = AudioToTextRecorder(use_microphone=False)
# ... feed audio, call text() ...
recorder.shutdown()
Skipping shutdown() will leave background threads running. Use one recorder per independent stream or session unless you are building a shared-engine server that injects its own executor callables.

Build docs developers (and LLMs) love