RealtimeSTT includes thin adapter engines for several model families that run through Hugging Face Transformers or Transformers-compatible tooling. These adapters share a common pattern: models are downloaded automatically from Hugging Face on first use, and backend behavior is tunable throughDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt
Use this file to discover all available pages before exploring further.
transcription_engine_options.
All Transformers-backed engines download model files from Hugging Face on first use. Gated or private models require you to accept the model’s license terms on Hugging Face and may need a
HF_TOKEN environment variable or hf_token option. Set download_root to a writable path to control where model files are cached.- Granite Speech — IBM Granite speech-to-text via
AutoModelForSpeechSeq2Seq - Qwen3 ASR — Alibaba Qwen3 ASR via the
qwen-asrpackage, with optional vLLM backend - Moonshine (Transformers) — see the dedicated Moonshine page
Shared Installation
Installing thetransformers extra covers Granite Speech, Moonshine, and Cohere Transcribe:
transformers and torch. For GPU use, install a CUDA-enabled PyTorch wheel first.
Granite Speech
The Granite Speech engine (granite_speech / granite) uses IBM’s ibm-granite/granite-speech-4.1-2b model through AutoModelForSpeechSeq2Seq and AutoProcessor.
Install
Basic Usage
ibm-granite/granite-speech-4.1-2b. On CPU, the model loads in float32; on GPU it defaults to bfloat16.
Controlling the Model Cache
Granite Options Reference
| Option | Meaning |
|---|---|
engine_options["processor"] | Processor load options passed to AutoProcessor.from_pretrained. |
engine_options["model"] | Model load options passed to AutoModelForSpeechSeq2Seq.from_pretrained. |
engine_options["generate"] | Generation options merged into model.generate(...). |
engine_options["prompt"] | Prompt text used for transcription. Defaults to "<|audio|>transcribe the speech with proper punctuation and capitalization." |
engine_options["include_language_in_prompt"] | Appends the active language to the prompt when language is set. |
Qwen3 ASR
The Qwen3 ASR engine (qwen3_asr / qwen_asr) uses Alibaba’s Qwen3-ASR model family through the qwen-asr package. It supports a standard Transformers backend and an optional vLLM backend for high-throughput server deployments.
Install
- Transformers backend
- vLLM backend
Basic Usage
Qwen/Qwen3-ASR-1.7B. Two-letter ISO language codes are mapped to full language names for common languages: Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, and Chinese.
Using the vLLM Backend
Qwen3 ASR Options Reference
| Option | Meaning |
|---|---|
engine_options["backend"] | "transformers" (default) or "vllm". |
engine_options["model"] | Model loader options. On the vLLM path, the model key is passed to LLM(...). |
engine_options["transcribe"] | Transcription options merged into model.transcribe(...). |
engine_options["language"] | Language when not passed through the top-level language option. |
engine_options["return_time_stamps"] | Request timestamps where supported. |
engine_options["sample_rate"] | Sample rate for in-memory audio. Defaults to 16000. |
Moonshine (Transformers)
Moonshine uses the same Transformers infrastructure. See the Moonshine engine page for install instructions, model options, and the sherpa-onnx CPU INT8 alternative path.Resource Considerations
Transformers-backed engines generally have:- Larger model downloads than the default faster-whisper path
- Higher memory requirements, especially on GPU
- Stricter CUDA/PyTorch compatibility — start from a clean virtual environment if you hit dependency conflicts
Troubleshooting
transformersmissing a required class — upgrade Transformers to the latest release.- Model download failures — check Hugging Face network access, confirm the model’s gated access terms have been accepted, and set
download_rootto a writable directory. - CUDA out of memory — reduce model size, switch to CPU, or run one shared model lane in the server.
- Qwen vLLM fails on Windows — move the run to Linux or WSL2.
