Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt

Use this file to discover all available pages before exploring further.

This page covers common RealtimeSTT setup and runtime issues organized by category. Engine-specific notes live in the individual engine pages under the Engines section.

Installation Issues

Install PortAudio system packages before running pip install.Linux:
sudo apt-get update
sudo apt-get install python3-dev portaudio19-dev
pip install "RealtimeSTT[faster-whisper]"
macOS:
brew install portaudio
pip install "RealtimeSTT[faster-whisper]"
Windows: Prefer pre-built wheels from PyPI where available. Run pip install from a normal terminal with the intended Python environment active.
RealtimeSTT imports optional engines lazily. If an engine import fails at runtime, install that backend’s package:
EngineRequired package
whisper_cpppywhispercpp
openai_whisperopenai-whisper
sherpa_onnx_*sherpa-onnx
parakeetnemo_toolkit[asr], soundfile
granite_speech, moonshine, cohere_transcribetransformers, torch
qwen3_asrqwen-asr
omnilingual_asrRealtimeSTT[omnilingual] on Linux/WSL2 with Python 3.11.x, plus a compatible PyTorch stack
kroko_onnxRealtimeSTT[kroko-builder,silero-onnx-cpu], then stt-install-kroko --build for recorder-based smoke tests and live microphone use

Audio and Microphone Issues

  • Confirm the OS granted microphone permission to the terminal or Python application.
  • Check the default input device in your system’s sound settings.
  • Pass input_device_index explicitly if the wrong device is being selected.
  • Run a small PyAudio device-list script or the recorder demo to enumerate available devices.
Overflow warnings mean audio is arriving faster than it is being consumed. Try one or more of the following:
  • Switch to a smaller or faster model.
  • Set device="cuda" when a GPU is available.
  • Increase realtime_processing_pause.
  • Lower the realtime beam size.
  • Increase queue or capacity settings in the FastAPI server.
When feeding audio via feed_audio(), ensure your chunks are:
  • 16-bit signed PCM bytes
  • Mono channel
  • Labeled with the correct original_sample_rate
  • In the correct byte order
See the external audio documentation for a complete guide.

CUDA and GPU Issues

Install a PyTorch build that matches your CUDA runtime and driver version. Use the PyTorch install selector for your exact CUDA version. For example, for CUDA 12.1:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
If GPU setup is not required, use device="cpu" with a small model.
  • Use a smaller model.
  • Use a smaller realtime model than the final model.
  • Set use_main_model_for_realtime=True to avoid loading two models, accepting some inference contention.
  • Reduce batch sizes.
  • Use CPU INT8 engines such as sherpa_onnx_moonshine for CPU-only deployments.

Model Issues

  • Check network access from the machine running the script.
  • Set download_root to a writable directory.
  • Authenticate if the model is gated: use huggingface-cli login or set the HF_TOKEN environment variable.
  • Pre-download models in the same environment that runs the application.
sherpa-onnx engines do not download model bundles automatically. Download and extract the required archive, then pass the extracted directory path:
AudioToTextRecorder(
    transcription_engine="sherpa_onnx_moonshine",
    model="test-model-cache/sherpa-onnx/sherpa-onnx-moonshine-tiny-en-int8",
    device="cpu",
    language="en",
)
The error message names the missing file. Confirm you are pointing at the extracted directory, not the .tar.bz2 archive file.

Runtime Behavior

The recorder waits for VAD to detect speech end before returning. If text() hangs:
  • Feed trailing silence when using external audio via feed_audio().
  • Lower post_speech_silence_duration for faster turn finalization.
  • Check that microphone input is being received and VAD sensitivity is tuned appropriately.
  • Confirm speech duration exceeds min_length_of_recording.
  • Use a smaller final model.
  • Enable CUDA with device="cuda".
  • Reduce beam_size.
  • Enable early_transcription_on_silence carefully.
  • For realtime UX, keep final model accuracy high but use a smaller separate realtime model.
  • Use realtime_model_type="tiny.en" or another small realtime model.
  • Set beam_size_realtime=1.
  • Increase realtime_processing_pause.
  • Enable syllable-boundary scheduling and tune follow-up delays.
  • Use a separate, lighter realtime engine and model if the final model is heavy.
  • Confirm wake_words or wakeword_backend is set.
  • For Porcupine, use one of the supported built-in keyword names.
  • For OpenWakeWord, pass valid model files and the matching framework.
  • Tune wake_words_sensitivity.
  • Test in a quiet room with the microphone close to your mouth first.
Increase wake_word_buffer_duration so more of the wake word audio is excluded from the following recording window.

FastAPI Server Issues

  • Confirm the server is running and listening on the expected host and port.
  • Open http://localhost:8010 from the same machine first to rule out a networking issue.
  • Check that the browser has microphone permission.
  • Check the /health endpoint for startup errors.
The server has reached the --max-sessions limit. Increase the limit only if the selected engine and hardware can handle the additional load.
The server intentionally coalesces stale realtime work to preserve final transcription accuracy. Tune these parameters to improve throughput:
  • --max-realtime-queue-age-ms
  • --realtime-processing-pause
  • --max-global-inference-queue-depth
  • model size and engine choice

Windows Notes

Some multiprocessing and model-loading tests may need to run from a normal terminal rather than a restricted sandbox environment. Parakeet/NeMo and Qwen vLLM are Linux-oriented; use WSL2 for those real-model paths on a Windows workstation.

Build docs developers (and LLMs) love