Run the RealtimeSTT FastAPI Browser Streaming Server

The example_fastapi_server directory in the RealtimeSTT repository contains a reference FastAPI application that streams microphone audio from a browser into per-session RealtimeSTT recorder instances. It serves a polished browser UI and exposes a WebSocket endpoint that handles concurrent sessions, shared inference workers, health checks, and operational metrics.

The FastAPI server is not installed by pip install RealtimeSTT. You must clone the repository (or install directly from Git) to use it. For pip-only setups, use the standalone Python recorder/API examples instead.

Installation

Clone the repository and create a virtual environment

git clone https://github.com/KoljaB/RealtimeSTT.git
cd RealtimeSTT
python -m venv .venv-fastapi
source .venv-fastapi/bin/activate        # Windows: .\.venv-fastapi\Scripts\Activate.ps1
python -m pip install -U pip setuptools wheel

Install server dependencies

pip install -r requirements.txt
pip install -r example_fastapi_server/requirements.txt

Install a transcription engine

The default engine is faster-whisper:

pip install "RealtimeSTT[faster-whisper]"

For other engines see Engine Selection below.

For pip-only installs without cloning the repository, use the Python recorder examples in the examples/ directory instead. The FastAPI server is intentionally kept source-only to avoid adding web-server dependencies to the core wheel.

Starting the Server

python example_fastapi_server/server.py --host 0.0.0.0 --port 8010

Then open http://localhost:8010 in your browser. The UI connects automatically to the WebSocket endpoint and streams microphone audio.

Engine Selection

Pass --engine and --model (plus --realtime-engine and --realtime-model for interim transcription) to select a different backend.

faster-whisper (default)
whisper.cpp (CPU)
sherpa-onnx Moonshine (CPU)
Parakeet (CUDA)

python example_fastapi_server/server.py \
  --host 0.0.0.0 \
  --port 8010 \
  --engine faster_whisper \
  --model small.en \
  --realtime-model tiny.en \
  --device cuda \
  --language en

pip install pywhispercpp
python example_fastapi_server/server.py \
  --host 0.0.0.0 \
  --port 8010 \
  --engine whisper_cpp \
  --model tiny.en \
  --realtime-engine whisper_cpp \
  --realtime-model tiny.en \
  --device cpu \
  --beam-size 5 \
  --beam-size-realtime 1 \
  --download-root test-model-cache/pywhispercpp \
  --engine-options '{"model":{"n_threads":8,"redirect_whispercpp_logs_to":null}}' \
  --realtime-engine-options '{"model":{"n_threads":8},"transcribe":{"single_segment":true,"no_context":true,"print_timestamps":false}}'

pip install sherpa-onnx
python example_fastapi_server/server.py \
  --engine sherpa_onnx_moonshine \
  --model sherpa-onnx-moonshine-tiny-en-int8 \
  --realtime-engine sherpa_onnx_moonshine \
  --realtime-model sherpa-onnx-moonshine-tiny-en-int8 \
  --device cpu \
  --language en \
  --download-root test-model-cache/sherpa-onnx \
  --engine-options '{"num_threads":2,"provider":"cpu"}' \
  --realtime-engine-options '{"num_threads":2,"provider":"cpu"}' \
  --realtime-processing-pause 0.8 \
  --realtime-use-syllable-boundaries

pip install "nemo_toolkit[asr]" soundfile librosa
python example_fastapi_server/server.py \
  --engine parakeet \
  --model nvidia/parakeet-tdt-0.6b-v3 \
  --realtime-engine faster_whisper \
  --realtime-model tiny.en \
  --device cuda \
  --language en

Use --use-main-model-for-realtime to share a single inference lane for both final and realtime work, reducing GPU memory usage:

python example_fastapi_server/server.py \
  --engine parakeet \
  --model nvidia/parakeet-tdt-0.6b-v3 \
  --use-main-model-for-realtime \
  --profile parakeet-low-latency \
  --device cuda \
  --language en

Multi-User Session Isolation

The server is designed for concurrent browser clients. Each WebSocket connection receives a unique sessionId and owns its own lightweight state machine:

Per-session: audio buffer, VAD state (WebRTC + Silero), transcript segment IDs, realtime text, final text, warnings, and error state.
Shared: heavy ASR inference workers — one final model lane and one realtime model lane (or a single shared lane with --use-main-model-for-realtime).

Inference is scheduled through per-session fair queues. Final jobs are preserved up to the configured per-session limit. Stale realtime jobs are coalesced so a single noisy client cannot fill the global queue with obsolete interim work. Control capacity with these flags:

python example_fastapi_server/server.py \
  --max-sessions 4 \
  --max-active-speakers 4 \
  --max-global-inference-queue-depth 64 \
  --max-final-queue-depth-per-session 8 \
  --max-realtime-queue-age-ms 1500 \
  --max-audio-queue-seconds-per-session 30

When --max-sessions is reached, new WebSocket clients receive an admission error and close code 1013. When active speaker capacity is reached, accepted sessions receive a warning while existing final work is preserved.

WebSocket Protocol

The browser sends binary audio packets to /ws/transcribe with the following layout:

[4 bytes: little-endian uint32 metadata length]
[N bytes: UTF-8 JSON metadata]
[remaining bytes: 16-bit little-endian mono PCM audio]

Example metadata object sent by the browser:

{
  "sampleRate": 48000,
  "channels": 1,
  "format": "pcm_s16le",
  "frames": 1920
}

Text commands are sent as JSON objects:

{ "type": "start" }

Supported commands: start, stop, clear, ping, metrics. Server event types:

Event type	Description
`hello`	Assigns `clientId` and `sessionId` to the new connection.
`ready`	Model lanes are initialized and the session can begin streaming.
`timeline`	Segment timing and wake word state transitions.
`realtime`	Interim transcript text for a session-local `segmentId`.
`final`	Final transcript text for the same `segmentId`; replaces the interim block.
`status`	Session or server state update.
`warning`	Recoverable issue (e.g., approaching capacity limits).
`error`	Command, packet, admission, or runtime error.
`clear`	Session transcript reset acknowledgement.
`pong`	Response to a `ping` command.
`metrics`	Per-session metrics in response to a `metrics` command.

Transcript-bearing events (realtime, final) include sessionId and are routed only to the session that produced them. They may also include a segment object containing recording start/end timestamps, duration, pre-recording buffer range, and wake word timing when available.

Health and Metrics

Use the /health endpoint for readiness checks and basic load information:

curl http://localhost:8010/health

Use /api/metrics for detailed operational data including queue depth, latency percentiles, coalescing counters, drop counters, and worker utilization:

curl http://localhost:8010/api/metrics

Additional endpoints:

Endpoint	Method	Description
`/`	`GET`	Browser UI.
`/health`	`GET`	Readiness, active sessions/speakers, startup errors, scheduler state.
`/api/config`	`GET`	Public settings, limits, supported engines, and runtime settings contract.
`/api/config`	`PATCH`	Update runtime-safe settings without restarting.
`/api/metrics`	`GET`	Counters, queue depth, p50/p95 latency, coalescing, drops, worker busy ratio.
`/ws/transcribe`	`WebSocket`	Browser audio stream and command channel.

Some settings can be changed while the server is running:

curl -X PATCH http://localhost:8010/api/config \
  -H 'Content-Type: application/json' \
  -d '{"settings":{"max_sessions":8,"wake_words":"jarvis"}}'

The GET /api/config response separates settings into activeSessionSafe, newSessionOnly, and startupOnly buckets. Engine and model paths are startupOnly — changing them after startup is rejected because shared inference workers are already initialized.

Wake Word Mode

Pass wake word flags to enable wake word activation for all browser sessions:

python example_fastapi_server/server.py \
  --engine faster_whisper \
  --model small.en \
  --realtime-model tiny.en \
  --wakeword-backend pvporcupine \
  --wake-words jarvis \
  --wake-words-sensitivity 0.7 \
  --wake-word-timeout 5 \
  --wake-word-followup-window 5

Wake word state transitions (wait, detect, timeout, follow-up voice window) are surfaced as timeline WebSocket events and visualised in the browser UI.

Tests

Fast unit tests use fake schedulers and do not load any ASR models:

python -m unittest -v \
  tests.unit.test_fastapi_server_protocol \
  tests.unit.test_fastapi_server_multi_user

Opt-in real-engine integration test (streams a reference audio file through multiple parallel sessions and compares WER):

REALTIMESTT_RUN_FASTAPI_MULTI_USER_PERF=1 \
python -m unittest -v tests.unit.test_fastapi_server_multi_user_asr_integration

Get Started

Guides

Transcription Engines

Resources

Run the RealtimeSTT FastAPI Browser Streaming Server

Installation

Starting the Server

Engine Selection

Multi-User Session Isolation

WebSocket Protocol

Health and Metrics

Wake Word Mode

Tests

Build docs developers (and LLMs) love

Get Started

Guides

Transcription Engines

Resources

Documentation Index

​Installation

​Starting the Server

​Engine Selection

​Multi-User Session Isolation

​WebSocket Protocol

​Health and Metrics

​Wake Word Mode

​Tests

Build docs developers (and LLMs) love

Installation

Starting the Server

Engine Selection

Multi-User Session Isolation

WebSocket Protocol

Health and Metrics

Wake Word Mode

Tests