Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt

Use this file to discover all available pages before exploring further.

The example_fastapi_server directory in the RealtimeSTT repository contains a reference FastAPI application that streams microphone audio from a browser into per-session RealtimeSTT recorder instances. It serves a polished browser UI and exposes a WebSocket endpoint that handles concurrent sessions, shared inference workers, health checks, and operational metrics.
The FastAPI server is not installed by pip install RealtimeSTT. You must clone the repository (or install directly from Git) to use it. For pip-only setups, use the standalone Python recorder/API examples instead.

Installation

1

Clone the repository and create a virtual environment

git clone https://github.com/KoljaB/RealtimeSTT.git
cd RealtimeSTT
python -m venv .venv-fastapi
source .venv-fastapi/bin/activate        # Windows: .\.venv-fastapi\Scripts\Activate.ps1
python -m pip install -U pip setuptools wheel
2

Install server dependencies

pip install -r requirements.txt
pip install -r example_fastapi_server/requirements.txt
3

Install a transcription engine

The default engine is faster-whisper:
pip install "RealtimeSTT[faster-whisper]"
For other engines see Engine Selection below.
For pip-only installs without cloning the repository, use the Python recorder examples in the examples/ directory instead. The FastAPI server is intentionally kept source-only to avoid adding web-server dependencies to the core wheel.

Starting the Server

python example_fastapi_server/server.py --host 0.0.0.0 --port 8010
Then open http://localhost:8010 in your browser. The UI connects automatically to the WebSocket endpoint and streams microphone audio.

Engine Selection

Pass --engine and --model (plus --realtime-engine and --realtime-model for interim transcription) to select a different backend.
python example_fastapi_server/server.py \
  --host 0.0.0.0 \
  --port 8010 \
  --engine faster_whisper \
  --model small.en \
  --realtime-model tiny.en \
  --device cuda \
  --language en
Use --use-main-model-for-realtime to share a single inference lane for both final and realtime work, reducing GPU memory usage:
python example_fastapi_server/server.py \
  --engine parakeet \
  --model nvidia/parakeet-tdt-0.6b-v3 \
  --use-main-model-for-realtime \
  --profile parakeet-low-latency \
  --device cuda \
  --language en

Multi-User Session Isolation

The server is designed for concurrent browser clients. Each WebSocket connection receives a unique sessionId and owns its own lightweight state machine:
  • Per-session: audio buffer, VAD state (WebRTC + Silero), transcript segment IDs, realtime text, final text, warnings, and error state.
  • Shared: heavy ASR inference workers — one final model lane and one realtime model lane (or a single shared lane with --use-main-model-for-realtime).
Inference is scheduled through per-session fair queues. Final jobs are preserved up to the configured per-session limit. Stale realtime jobs are coalesced so a single noisy client cannot fill the global queue with obsolete interim work. Control capacity with these flags:
python example_fastapi_server/server.py \
  --max-sessions 4 \
  --max-active-speakers 4 \
  --max-global-inference-queue-depth 64 \
  --max-final-queue-depth-per-session 8 \
  --max-realtime-queue-age-ms 1500 \
  --max-audio-queue-seconds-per-session 30
When --max-sessions is reached, new WebSocket clients receive an admission error and close code 1013. When active speaker capacity is reached, accepted sessions receive a warning while existing final work is preserved.

WebSocket Protocol

The browser sends binary audio packets to /ws/transcribe with the following layout:
[4 bytes: little-endian uint32 metadata length]
[N bytes: UTF-8 JSON metadata]
[remaining bytes: 16-bit little-endian mono PCM audio]
Example metadata object sent by the browser:
{
  "sampleRate": 48000,
  "channels": 1,
  "format": "pcm_s16le",
  "frames": 1920
}
Text commands are sent as JSON objects:
{ "type": "start" }
Supported commands: start, stop, clear, ping, metrics. Server event types:
Event typeDescription
helloAssigns clientId and sessionId to the new connection.
readyModel lanes are initialized and the session can begin streaming.
timelineSegment timing and wake word state transitions.
realtimeInterim transcript text for a session-local segmentId.
finalFinal transcript text for the same segmentId; replaces the interim block.
statusSession or server state update.
warningRecoverable issue (e.g., approaching capacity limits).
errorCommand, packet, admission, or runtime error.
clearSession transcript reset acknowledgement.
pongResponse to a ping command.
metricsPer-session metrics in response to a metrics command.
Transcript-bearing events (realtime, final) include sessionId and are routed only to the session that produced them. They may also include a segment object containing recording start/end timestamps, duration, pre-recording buffer range, and wake word timing when available.

Health and Metrics

Use the /health endpoint for readiness checks and basic load information:
curl http://localhost:8010/health
Use /api/metrics for detailed operational data including queue depth, latency percentiles, coalescing counters, drop counters, and worker utilization:
curl http://localhost:8010/api/metrics
Additional endpoints:
EndpointMethodDescription
/GETBrowser UI.
/healthGETReadiness, active sessions/speakers, startup errors, scheduler state.
/api/configGETPublic settings, limits, supported engines, and runtime settings contract.
/api/configPATCHUpdate runtime-safe settings without restarting.
/api/metricsGETCounters, queue depth, p50/p95 latency, coalescing, drops, worker busy ratio.
/ws/transcribeWebSocketBrowser audio stream and command channel.
Some settings can be changed while the server is running:
curl -X PATCH http://localhost:8010/api/config \
  -H 'Content-Type: application/json' \
  -d '{"settings":{"max_sessions":8,"wake_words":"jarvis"}}'
The GET /api/config response separates settings into activeSessionSafe, newSessionOnly, and startupOnly buckets. Engine and model paths are startupOnly — changing them after startup is rejected because shared inference workers are already initialized.

Wake Word Mode

Pass wake word flags to enable wake word activation for all browser sessions:
python example_fastapi_server/server.py \
  --engine faster_whisper \
  --model small.en \
  --realtime-model tiny.en \
  --wakeword-backend pvporcupine \
  --wake-words jarvis \
  --wake-words-sensitivity 0.7 \
  --wake-word-timeout 5 \
  --wake-word-followup-window 5
Wake word state transitions (wait, detect, timeout, follow-up voice window) are surfaced as timeline WebSocket events and visualised in the browser UI.

Tests

Fast unit tests use fake schedulers and do not load any ASR models:
python -m unittest -v \
  tests.unit.test_fastapi_server_protocol \
  tests.unit.test_fastapi_server_multi_user
Opt-in real-engine integration test (streams a reference audio file through multiple parallel sessions and compares WER):
REALTIMESTT_RUN_FASTAPI_MULTI_USER_PERF=1 \
python -m unittest -v tests.unit.test_fastapi_server_multi_user_asr_integration

Build docs developers (and LLMs) love