Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/KoljaB/RealtimeSTT/llms.txt

Use this file to discover all available pages before exploring further.

RealtimeSTT includes thin adapter engines for several model families that run through Hugging Face Transformers or Transformers-compatible tooling. These adapters share a common pattern: models are downloaded automatically from Hugging Face on first use, and backend behavior is tunable through transcription_engine_options.
All Transformers-backed engines download model files from Hugging Face on first use. Gated or private models require you to accept the model’s license terms on Hugging Face and may need a HF_TOKEN environment variable or hf_token option. Set download_root to a writable path to control where model files are cached.
The engines covered on this page are:
  • Granite Speech — IBM Granite speech-to-text via AutoModelForSpeechSeq2Seq
  • Qwen3 ASR — Alibaba Qwen3 ASR via the qwen-asr package, with optional vLLM backend
  • Moonshine (Transformers) — see the dedicated Moonshine page

Shared Installation

Installing the transformers extra covers Granite Speech, Moonshine, and Cohere Transcribe:
pip install "RealtimeSTT[transformers]"
This installs transformers and torch. For GPU use, install a CUDA-enabled PyTorch wheel first.

Granite Speech

The Granite Speech engine (granite_speech / granite) uses IBM’s ibm-granite/granite-speech-4.1-2b model through AutoModelForSpeechSeq2Seq and AutoProcessor.

Install

pip install "RealtimeSTT[granite]"

Basic Usage

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    transcription_engine="granite_speech",
    model="ibm-granite/granite-speech-4.1-2b",
    device="cuda",
    transcription_engine_options={
        "generate": {
            "max_new_tokens": 200,
            "do_sample": False,
        },
    },
)
The default model is ibm-granite/granite-speech-4.1-2b. On CPU, the model loads in float32; on GPU it defaults to bfloat16.

Controlling the Model Cache

recorder = AudioToTextRecorder(
    transcription_engine="granite_speech",
    model="ibm-granite/granite-speech-4.1-2b",
    download_root="models/hf",
)

Granite Options Reference

OptionMeaning
engine_options["processor"]Processor load options passed to AutoProcessor.from_pretrained.
engine_options["model"]Model load options passed to AutoModelForSpeechSeq2Seq.from_pretrained.
engine_options["generate"]Generation options merged into model.generate(...).
engine_options["prompt"]Prompt text used for transcription. Defaults to "<|audio|>transcribe the speech with proper punctuation and capitalization."
engine_options["include_language_in_prompt"]Appends the active language to the prompt when language is set.

Qwen3 ASR

The Qwen3 ASR engine (qwen3_asr / qwen_asr) uses Alibaba’s Qwen3-ASR model family through the qwen-asr package. It supports a standard Transformers backend and an optional vLLM backend for high-throughput server deployments.

Install

pip install "RealtimeSTT[qwen]"

Basic Usage

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    transcription_engine="qwen3_asr",
    model="Qwen/Qwen3-ASR-1.7B",
    language="en",
    device="cuda",
)
The default model is Qwen/Qwen3-ASR-1.7B. Two-letter ISO language codes are mapped to full language names for common languages: Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, and Chinese.

Using the vLLM Backend

recorder = AudioToTextRecorder(
    transcription_engine="qwen3_asr",
    model="Qwen/Qwen3-ASR-1.7B",
    transcription_engine_options={
        "backend": "vllm",
    },
)

Qwen3 ASR Options Reference

OptionMeaning
engine_options["backend"]"transformers" (default) or "vllm".
engine_options["model"]Model loader options. On the vLLM path, the model key is passed to LLM(...).
engine_options["transcribe"]Transcription options merged into model.transcribe(...).
engine_options["language"]Language when not passed through the top-level language option.
engine_options["return_time_stamps"]Request timestamps where supported.
engine_options["sample_rate"]Sample rate for in-memory audio. Defaults to 16000.

Moonshine (Transformers)

Moonshine uses the same Transformers infrastructure. See the Moonshine engine page for install instructions, model options, and the sherpa-onnx CPU INT8 alternative path.

Resource Considerations

Transformers-backed engines generally have:
  • Larger model downloads than the default faster-whisper path
  • Higher memory requirements, especially on GPU
  • Stricter CUDA/PyTorch compatibility — start from a clean virtual environment if you hit dependency conflicts
For server deployments, use the FastAPI server’s shared inference lanes instead of instantiating one model per session.

Troubleshooting

  • transformers missing a required class — upgrade Transformers to the latest release.
  • Model download failures — check Hugging Face network access, confirm the model’s gated access terms have been accepted, and set download_root to a writable directory.
  • CUDA out of memory — reduce model size, switch to CPU, or run one shared model lane in the server.
  • Qwen vLLM fails on Windows — move the run to Linux or WSL2.

Build docs developers (and LLMs) love