Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt

Use this file to discover all available pages before exploring further.

Beyond AudioToTextRecorder, the RealtimeSTT package exports three additional public classes: AudioToTextRecorderClient for communicating with a running stt-server over WebSockets, AudioInput for low-level microphone capture, and RealtimeSpeechBoundaryDetector (together with SpeechBoundaryEvent and SpeechBoundaryResult) for standalone acoustic boundary detection. All are importable directly from RealtimeSTT.
from RealtimeSTT import (
    AudioToTextRecorderClient,
    AudioInput,
    RealtimeSpeechBoundaryDetector,
    SpeechBoundaryEvent,
    SpeechBoundaryResult,
)

AudioToTextRecorderClient

AudioToTextRecorderClient provides the same text() / feed_audio() interface as AudioToTextRecorder but delegates all transcription work to a separate stt-server process over two WebSocket connections — a control socket (default port 8011) and a data socket (default port 8012). The client captures local audio and streams raw PCM to the server; all VAD, model inference, and result assembly happen server-side. When to use the client:
  • Distributed setups where the GPU node is separate from the audio capture node.
  • Multiple microphone clients sharing a single high-memory model server.
  • Environments where loading a Whisper model locally is not practical (e.g. a Raspberry Pi or a browser proxy).

Constructor

from RealtimeSTT import AudioToTextRecorderClient

client = AudioToTextRecorderClient(
    model="small.en",
    language="en",
    control_url="ws://127.0.0.1:8011",
    data_url="ws://127.0.0.1:8012",
    autostart_server=True,
)
The constructor accepts nearly all the same parameters as AudioToTextRecorder (model, language, VAD settings, wake word settings, callbacks, etc.) and maps them to server startup arguments when autostart_server=True. Server-specific parameters:
ParameterDefaultDescription
control_url"ws://127.0.0.1:8011"WebSocket URL for the server control channel.
data_url"ws://127.0.0.1:8012"WebSocket URL for the server audio data channel.
autostart_serverTrueWhen True, automatically launches stt-server if it is not already reachable.
output_wav_fileNoneIf set, the client writes the captured audio stream to this WAV file path.

text()

client.text(on_transcription_finished=None) -> str
Identical contract to AudioToTextRecorder.text(). Blocks until the server returns a full sentence, or calls on_transcription_finished(text) asynchronously if a callback is supplied.

feed_audio()

client.feed_audio(chunk, audio_meta_data, original_sample_rate=16000)
Sends one raw audio chunk to the server over the data WebSocket. The chunk is framed with a small JSON metadata header containing the sample rate and any supplied metadata. Use this when use_microphone=False and you are pumping audio from an external source.
  • chunk — raw 16-bit mono PCM bytes.
  • audio_meta_data — optional dict of extra metadata (e.g. timing) merged into the frame header; pass None if not needed.
  • original_sample_rate — sample rate of the chunk; the server resamples to 16 kHz if necessary.

Other Methods

MethodDescription
abort()Requests the server to abort the active recording.
wakeup()Requests the server to bypass wake-word gating and begin recording immediately.
clear_audio_queue()Requests the server to discard queued audio.
perform_final_transcription()Requests a final transcription from the server immediately.
stop()Requests the server to stop recording.
set_parameter(parameter, value)Sends a live parameter update to the running server.
get_parameter(parameter)Retrieves the current value of a server parameter (blocks up to 5 s).
set_microphone(microphone_on)Mutes or un-mutes the local microphone capture.
list_devices()Prints available audio input devices and their supported sample rates.
shutdown()Closes WebSocket connections and stops all client threads.
connect()Opens WebSocket connections to the server (called automatically in __init__).

Basic Usage

from RealtimeSTT import AudioToTextRecorderClient

if __name__ == "__main__":
    # Start stt-server separately or set autostart_server=True
    with AudioToTextRecorderClient(
        model="small.en",
        language="en",
        enable_realtime_transcription=True,
        on_realtime_transcription_update=lambda t: print(f"\r{t}   ", end=""),
    ) as client:
        while True:
            text = client.text()
            if text:
                print(f"\n{text}")
AudioToTextRecorderClient also supports the context manager protocol: the with block calls shutdown() on exit.
When autostart_server=True and no server is running, the client spawns stt-server as a background subprocess and waits up to 20 seconds for it to become reachable. After the first client exits, the server continues running for subsequent clients to reuse.

AudioInput

AudioInput is a self-contained PyAudio wrapper that handles device selection, sample rate negotiation, optional anti-aliasing, and polyphase resampling to a target rate. It is used internally by both AudioToTextRecorder (in its microphone capture loop) and AudioToTextRecorderClient, and is exported for use in custom audio pipelines.

Constructor

from RealtimeSTT import AudioInput

audio = AudioInput(
    input_device_index=None,   # None → PyAudio default device
    debug_mode=False,
    target_samplerate=16000,
    chunk_size=1024,
    audio_format=pyaudio.paInt16,
    channels=1,
    resample_to_target=True,
)
ParameterDefaultDescription
input_device_indexNonePyAudio device index. None selects the system default.
debug_modeFalsePrints device selection and stream setup details.
target_samplerate16000Target sample rate after optional resampling.
chunk_size1024Frames per read call.
audio_formatpyaudio.paInt16PyAudio sample format constant.
channels1Number of input channels.
resample_to_targetTrueWhen True, resamples device audio to target_samplerate if the device uses a different rate.

Key Methods

audio.setup()                       # opens the stream; returns True on success
audio.read_chunk() -> bytes         # reads one chunk from the open stream
audio.list_devices()                # prints all input devices with sample rates
audio.cleanup()                     # stops and closes the stream, terminates PyAudio

Usage Example

from RealtimeSTT import AudioInput

audio = AudioInput(target_samplerate=16000, chunk_size=1024)
if audio.setup():
    for _ in range(100):
        chunk = audio.read_chunk()
        # process chunk ...
    audio.cleanup()

RealtimeSpeechBoundaryDetector

RealtimeSpeechBoundaryDetector is a lightweight streaming detector that identifies likely inter-syllable acoustic boundaries in real time by analysing log-energy envelopes and voiced-energy valleys. It operates on 10 ms frames and uses a short lookahead window to confirm candidates before emitting them. The detector does not require a neural model — it runs in pure Python/NumPy and adds negligible latency. It is embedded inside AudioToTextRecorder when realtime_transcription_use_syllable_boundaries=True. You can also instantiate it standalone when you need boundary events without the full recorder stack.

Constructor

from RealtimeSTT import RealtimeSpeechBoundaryDetector

detector = RealtimeSpeechBoundaryDetector(
    sample_rate=16000,
    sensitivity=0.6,         # 0 = conservative, 1 = eager
)
ParameterDefaultDescription
sample_rate16000Sample rate of the incoming audio.
frame_ms10.0Analysis frame duration in milliseconds.
lookahead_ms30.0Lookahead window used to confirm boundary candidates.
history_ms900.0Rolling history of frames kept for context.
sensitivity0.6Detection sensitivity from 0 (conservative) to 1 (eager). Tunes all derived thresholds.
min_rms0.004Minimum RMS for a frame to be considered as containing speech.
min_boundary_interval_ms160.0Minimum gap between consecutive boundary events.
min_voiced_ms70.0Minimum voiced audio before a boundary can be declared.
min_vowel_ms40.0Minimum vowel-like audio required in the look-back window.
Advanced thresholds (speech_margin_db, drop_db, valley_depth_db, recovery_db, vowel_margin_db, min_voicing_score, max_vowel_zero_crossing_rate) are all derived automatically from sensitivity but can be overridden individually.

Processing Methods

result = detector.process_bytes(pcm_chunk: bytes) -> SpeechBoundaryResult
result = detector.process_samples(samples)        -> SpeechBoundaryResult
detector.reset()                                  # clears all history
process_bytes accepts raw little-endian int16 PCM bytes. process_samples accepts a NumPy array of int16 or float32 values. Both return a SpeechBoundaryResult.

SpeechBoundaryResult

result.boundary_detected   # bool — True if at least one event was emitted
result.events              # List[SpeechBoundaryEvent]
result.latest_event        # SpeechBoundaryEvent | None — most recent event
result.current_energy_db   # float — energy of the last processed frame (dB)
result.current_rms         # float — RMS amplitude of the last frame
result.noise_floor_db      # float — rolling noise floor estimate (dB)
result.is_speech           # bool — whether the last frame was classified as speech
result.is_vowel_like       # bool — whether the last frame was vowel-like
result.voicing_score       # float — voicing confidence [0, 1] of the last frame
result.processed_frames    # int — number of frames processed in this call

SpeechBoundaryEvent

Each element of result.events is a SpeechBoundaryEvent with the following attributes:
event.boundary_sample        # int   — sample index of the boundary
event.boundary_time_seconds  # float — boundary time in seconds from stream start
event.score                  # float — detection confidence score
event.reason                 # str   — "vowel-ended" or "vowel-to-pause"
event.energy_db              # float — frame energy at the boundary (dB)
event.noise_floor_db         # float — noise floor at detection time (dB)
event.drop_db                # float — energy drop from peak to boundary (dB)
event.valley_depth_db        # float — valley depth relative to surrounding frames (dB)
event.latency_ms             # float — detection latency in milliseconds
event.created_at             # float — Unix timestamp when the event was created
event.as_dict()              # -> Dict[str, float]  — plain metadata dict

Standalone Example

from RealtimeSTT import RealtimeSpeechBoundaryDetector

detector = RealtimeSpeechBoundaryDetector(sample_rate=16000, sensitivity=0.6)

CHUNK_BYTES = 3200  # 100 ms of 16 kHz int16 mono

with open("speech.pcm", "rb") as f:
    while True:
        chunk = f.read(CHUNK_BYTES)
        if not chunk:
            break

        result = detector.process_bytes(chunk)
        if result.boundary_detected:
            for event in result.events:
                print(
                    f"Boundary at {event.boundary_time_seconds:.3f}s "
                    f"(score={event.score:.2f}, reason={event.reason!r}, "
                    f"latency={event.latency_ms:.1f}ms)"
                )
Use standalone vs embedded:
  • Embedded (realtime_transcription_use_syllable_boundaries=True on AudioToTextRecorder): the detector schedules realtime transcription calls at acoustic boundaries in addition to the fixed realtime_processing_pause timer. This reduces transcription lag on natural speech rhythms.
  • Standalone: use the detector directly when you need boundary signals for a purpose other than transcription scheduling — for example, to segment audio into utterance-level clips, to drive UI animations, or to feed a custom event loop.

CLI Commands

Installing RealtimeSTT also installs several console scripts.
The server commands are provided by the RealtimeSTT_server package, which is installed as a dependency. Run each command with --help to see all available flags.
# Start the WebSocket STT server (requires RealtimeSTT_server)
stt-server --model small.en --language en --control_port 8011 --data_port 8012

# Connect the interactive CLI client to a running server
stt

# Build and install the Kroko-ONNX extension
stt-install-kroko
CommandPurpose
stt-serverStarts the WebSocket transcription server that AudioToTextRecorderClient connects to. Accepts --model, --language, --control_port, --data_port, and most recorder parameters as flags.
sttInteractive CLI client that connects to a running stt-server and prints transcriptions to stdout.
stt-install-krokoBuilds and installs the Kroko-ONNX backend from source. Only needed when using the Kroko transcription engine.
# Example: start a server with a larger model on a specific port
stt-server --model large-v2 --language en --control_port 9011 --data_port 9012 --debug

Build docs developers (and LLMs) love