Documentation Index
Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt
Use this file to discover all available pages before exploring further.
Beyond AudioToTextRecorder, the RealtimeSTT package exports three additional public classes: AudioToTextRecorderClient for communicating with a running stt-server over WebSockets, AudioInput for low-level microphone capture, and RealtimeSpeechBoundaryDetector (together with SpeechBoundaryEvent and SpeechBoundaryResult) for standalone acoustic boundary detection. All are importable directly from RealtimeSTT.
from RealtimeSTT import (
AudioToTextRecorderClient,
AudioInput,
RealtimeSpeechBoundaryDetector,
SpeechBoundaryEvent,
SpeechBoundaryResult,
)
AudioToTextRecorderClient
AudioToTextRecorderClient provides the same text() / feed_audio() interface as AudioToTextRecorder but delegates all transcription work to a separate stt-server process over two WebSocket connections — a control socket (default port 8011) and a data socket (default port 8012). The client captures local audio and streams raw PCM to the server; all VAD, model inference, and result assembly happen server-side.
When to use the client:
- Distributed setups where the GPU node is separate from the audio capture node.
- Multiple microphone clients sharing a single high-memory model server.
- Environments where loading a Whisper model locally is not practical (e.g. a Raspberry Pi or a browser proxy).
Constructor
from RealtimeSTT import AudioToTextRecorderClient
client = AudioToTextRecorderClient(
model="small.en",
language="en",
control_url="ws://127.0.0.1:8011",
data_url="ws://127.0.0.1:8012",
autostart_server=True,
)
The constructor accepts nearly all the same parameters as AudioToTextRecorder (model, language, VAD settings, wake word settings, callbacks, etc.) and maps them to server startup arguments when autostart_server=True. Server-specific parameters:
| Parameter | Default | Description |
|---|
control_url | "ws://127.0.0.1:8011" | WebSocket URL for the server control channel. |
data_url | "ws://127.0.0.1:8012" | WebSocket URL for the server audio data channel. |
autostart_server | True | When True, automatically launches stt-server if it is not already reachable. |
output_wav_file | None | If set, the client writes the captured audio stream to this WAV file path. |
text()
client.text(on_transcription_finished=None) -> str
Identical contract to AudioToTextRecorder.text(). Blocks until the server returns a full sentence, or calls on_transcription_finished(text) asynchronously if a callback is supplied.
feed_audio()
client.feed_audio(chunk, audio_meta_data, original_sample_rate=16000)
Sends one raw audio chunk to the server over the data WebSocket. The chunk is framed with a small JSON metadata header containing the sample rate and any supplied metadata. Use this when use_microphone=False and you are pumping audio from an external source.
chunk — raw 16-bit mono PCM bytes.
audio_meta_data — optional dict of extra metadata (e.g. timing) merged into the frame header; pass None if not needed.
original_sample_rate — sample rate of the chunk; the server resamples to 16 kHz if necessary.
Other Methods
| Method | Description |
|---|
abort() | Requests the server to abort the active recording. |
wakeup() | Requests the server to bypass wake-word gating and begin recording immediately. |
clear_audio_queue() | Requests the server to discard queued audio. |
perform_final_transcription() | Requests a final transcription from the server immediately. |
stop() | Requests the server to stop recording. |
set_parameter(parameter, value) | Sends a live parameter update to the running server. |
get_parameter(parameter) | Retrieves the current value of a server parameter (blocks up to 5 s). |
set_microphone(microphone_on) | Mutes or un-mutes the local microphone capture. |
list_devices() | Prints available audio input devices and their supported sample rates. |
shutdown() | Closes WebSocket connections and stops all client threads. |
connect() | Opens WebSocket connections to the server (called automatically in __init__). |
Basic Usage
from RealtimeSTT import AudioToTextRecorderClient
if __name__ == "__main__":
# Start stt-server separately or set autostart_server=True
with AudioToTextRecorderClient(
model="small.en",
language="en",
enable_realtime_transcription=True,
on_realtime_transcription_update=lambda t: print(f"\r{t} ", end=""),
) as client:
while True:
text = client.text()
if text:
print(f"\n{text}")
AudioToTextRecorderClient also supports the context manager protocol: the with block calls shutdown() on exit.
When autostart_server=True and no server is running, the client spawns stt-server as a background subprocess and waits up to 20 seconds for it to become reachable. After the first client exits, the server continues running for subsequent clients to reuse.
AudioInput is a self-contained PyAudio wrapper that handles device selection, sample rate negotiation, optional anti-aliasing, and polyphase resampling to a target rate. It is used internally by both AudioToTextRecorder (in its microphone capture loop) and AudioToTextRecorderClient, and is exported for use in custom audio pipelines.
Constructor
from RealtimeSTT import AudioInput
audio = AudioInput(
input_device_index=None, # None → PyAudio default device
debug_mode=False,
target_samplerate=16000,
chunk_size=1024,
audio_format=pyaudio.paInt16,
channels=1,
resample_to_target=True,
)
| Parameter | Default | Description |
|---|
input_device_index | None | PyAudio device index. None selects the system default. |
debug_mode | False | Prints device selection and stream setup details. |
target_samplerate | 16000 | Target sample rate after optional resampling. |
chunk_size | 1024 | Frames per read call. |
audio_format | pyaudio.paInt16 | PyAudio sample format constant. |
channels | 1 | Number of input channels. |
resample_to_target | True | When True, resamples device audio to target_samplerate if the device uses a different rate. |
Key Methods
audio.setup() # opens the stream; returns True on success
audio.read_chunk() -> bytes # reads one chunk from the open stream
audio.list_devices() # prints all input devices with sample rates
audio.cleanup() # stops and closes the stream, terminates PyAudio
Usage Example
from RealtimeSTT import AudioInput
audio = AudioInput(target_samplerate=16000, chunk_size=1024)
if audio.setup():
for _ in range(100):
chunk = audio.read_chunk()
# process chunk ...
audio.cleanup()
RealtimeSpeechBoundaryDetector
RealtimeSpeechBoundaryDetector is a lightweight streaming detector that identifies likely inter-syllable acoustic boundaries in real time by analysing log-energy envelopes and voiced-energy valleys. It operates on 10 ms frames and uses a short lookahead window to confirm candidates before emitting them. The detector does not require a neural model — it runs in pure Python/NumPy and adds negligible latency.
It is embedded inside AudioToTextRecorder when realtime_transcription_use_syllable_boundaries=True. You can also instantiate it standalone when you need boundary events without the full recorder stack.
Constructor
from RealtimeSTT import RealtimeSpeechBoundaryDetector
detector = RealtimeSpeechBoundaryDetector(
sample_rate=16000,
sensitivity=0.6, # 0 = conservative, 1 = eager
)
| Parameter | Default | Description |
|---|
sample_rate | 16000 | Sample rate of the incoming audio. |
frame_ms | 10.0 | Analysis frame duration in milliseconds. |
lookahead_ms | 30.0 | Lookahead window used to confirm boundary candidates. |
history_ms | 900.0 | Rolling history of frames kept for context. |
sensitivity | 0.6 | Detection sensitivity from 0 (conservative) to 1 (eager). Tunes all derived thresholds. |
min_rms | 0.004 | Minimum RMS for a frame to be considered as containing speech. |
min_boundary_interval_ms | 160.0 | Minimum gap between consecutive boundary events. |
min_voiced_ms | 70.0 | Minimum voiced audio before a boundary can be declared. |
min_vowel_ms | 40.0 | Minimum vowel-like audio required in the look-back window. |
Advanced thresholds (speech_margin_db, drop_db, valley_depth_db, recovery_db, vowel_margin_db, min_voicing_score, max_vowel_zero_crossing_rate) are all derived automatically from sensitivity but can be overridden individually.
Processing Methods
result = detector.process_bytes(pcm_chunk: bytes) -> SpeechBoundaryResult
result = detector.process_samples(samples) -> SpeechBoundaryResult
detector.reset() # clears all history
process_bytes accepts raw little-endian int16 PCM bytes. process_samples accepts a NumPy array of int16 or float32 values. Both return a SpeechBoundaryResult.
SpeechBoundaryResult
result.boundary_detected # bool — True if at least one event was emitted
result.events # List[SpeechBoundaryEvent]
result.latest_event # SpeechBoundaryEvent | None — most recent event
result.current_energy_db # float — energy of the last processed frame (dB)
result.current_rms # float — RMS amplitude of the last frame
result.noise_floor_db # float — rolling noise floor estimate (dB)
result.is_speech # bool — whether the last frame was classified as speech
result.is_vowel_like # bool — whether the last frame was vowel-like
result.voicing_score # float — voicing confidence [0, 1] of the last frame
result.processed_frames # int — number of frames processed in this call
SpeechBoundaryEvent
Each element of result.events is a SpeechBoundaryEvent with the following attributes:
event.boundary_sample # int — sample index of the boundary
event.boundary_time_seconds # float — boundary time in seconds from stream start
event.score # float — detection confidence score
event.reason # str — "vowel-ended" or "vowel-to-pause"
event.energy_db # float — frame energy at the boundary (dB)
event.noise_floor_db # float — noise floor at detection time (dB)
event.drop_db # float — energy drop from peak to boundary (dB)
event.valley_depth_db # float — valley depth relative to surrounding frames (dB)
event.latency_ms # float — detection latency in milliseconds
event.created_at # float — Unix timestamp when the event was created
event.as_dict() # -> Dict[str, float] — plain metadata dict
Standalone Example
from RealtimeSTT import RealtimeSpeechBoundaryDetector
detector = RealtimeSpeechBoundaryDetector(sample_rate=16000, sensitivity=0.6)
CHUNK_BYTES = 3200 # 100 ms of 16 kHz int16 mono
with open("speech.pcm", "rb") as f:
while True:
chunk = f.read(CHUNK_BYTES)
if not chunk:
break
result = detector.process_bytes(chunk)
if result.boundary_detected:
for event in result.events:
print(
f"Boundary at {event.boundary_time_seconds:.3f}s "
f"(score={event.score:.2f}, reason={event.reason!r}, "
f"latency={event.latency_ms:.1f}ms)"
)
Use standalone vs embedded:
- Embedded (
realtime_transcription_use_syllable_boundaries=True on AudioToTextRecorder): the detector schedules realtime transcription calls at acoustic boundaries in addition to the fixed realtime_processing_pause timer. This reduces transcription lag on natural speech rhythms.
- Standalone: use the detector directly when you need boundary signals for a purpose other than transcription scheduling — for example, to segment audio into utterance-level clips, to drive UI animations, or to feed a custom event loop.
CLI Commands
Installing RealtimeSTT also installs several console scripts.
The server commands are provided by the RealtimeSTT_server package, which is installed as a dependency. Run each command with --help to see all available flags.
# Start the WebSocket STT server (requires RealtimeSTT_server)
stt-server --model small.en --language en --control_port 8011 --data_port 8012
# Connect the interactive CLI client to a running server
stt
# Build and install the Kroko-ONNX extension
stt-install-kroko
| Command | Purpose |
|---|
stt-server | Starts the WebSocket transcription server that AudioToTextRecorderClient connects to. Accepts --model, --language, --control_port, --data_port, and most recorder parameters as flags. |
stt | Interactive CLI client that connects to a running stt-server and prints transcriptions to stdout. |
stt-install-kroko | Builds and installs the Kroko-ONNX backend from source. Only needed when using the Kroko transcription engine. |
# Example: start a server with a larger model on a specific port
stt-server --model large-v2 --language en --control_port 9011 --data_port 9012 --debug