Hugging Face Transformers ASR Engines for RealtimeSTT

RealtimeSTT includes thin adapter engines for several model families that run through Hugging Face Transformers or Transformers-compatible tooling. These adapters share a common pattern: models are downloaded automatically from Hugging Face on first use, and backend behavior is tunable through transcription_engine_options.

All Transformers-backed engines download model files from Hugging Face on first use. Gated or private models require you to accept the model’s license terms on Hugging Face and may need a HF_TOKEN environment variable or hf_token option. Set download_root to a writable path to control where model files are cached.

The engines covered on this page are:

Granite Speech — IBM Granite speech-to-text via AutoModelForSpeechSeq2Seq
Qwen3 ASR — Alibaba Qwen3 ASR via the qwen-asr package, with optional vLLM backend
Moonshine (Transformers) — see the dedicated Moonshine page

Shared Installation

Installing the transformers extra covers Granite Speech, Moonshine, and Cohere Transcribe:

pip install "RealtimeSTT[transformers]"

This installs transformers and torch. For GPU use, install a CUDA-enabled PyTorch wheel first.

Granite Speech

The Granite Speech engine (granite_speech / granite) uses IBM’s ibm-granite/granite-speech-4.1-2b model through AutoModelForSpeechSeq2Seq and AutoProcessor.

Install

pip install "RealtimeSTT[granite]"

Basic Usage

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    transcription_engine="granite_speech",
    model="ibm-granite/granite-speech-4.1-2b",
    device="cuda",
    transcription_engine_options={
        "generate": {
            "max_new_tokens": 200,
            "do_sample": False,
        },
    },
)

The default model is ibm-granite/granite-speech-4.1-2b. On CPU, the model loads in float32; on GPU it defaults to bfloat16.

Controlling the Model Cache

recorder = AudioToTextRecorder(
    transcription_engine="granite_speech",
    model="ibm-granite/granite-speech-4.1-2b",
    download_root="models/hf",
)

Granite Options Reference

Option	Meaning
`engine_options["processor"]`	Processor load options passed to `AutoProcessor.from_pretrained`.
`engine_options["model"]`	Model load options passed to `AutoModelForSpeechSeq2Seq.from_pretrained`.
`engine_options["generate"]`	Generation options merged into `model.generate(...)`.
`engine_options["prompt"]`	Prompt text used for transcription. Defaults to `"<\|audio\|>transcribe the speech with proper punctuation and capitalization."`
`engine_options["include_language_in_prompt"]`	Appends the active language to the prompt when `language` is set.

Qwen3 ASR

The Qwen3 ASR engine (qwen3_asr / qwen_asr) uses Alibaba’s Qwen3-ASR model family through the qwen-asr package. It supports a standard Transformers backend and an optional vLLM backend for high-throughput server deployments.

Install

Transformers backend
vLLM backend

pip install "RealtimeSTT[qwen]"

pip install "RealtimeSTT[qwen-vllm]"

vLLM is Linux-oriented. On Windows, use WSL2 for real vLLM testing.

Basic Usage

from RealtimeSTT import AudioToTextRecorder

recorder = AudioToTextRecorder(
    transcription_engine="qwen3_asr",
    model="Qwen/Qwen3-ASR-1.7B",
    language="en",
    device="cuda",
)

The default model is Qwen/Qwen3-ASR-1.7B. Two-letter ISO language codes are mapped to full language names for common languages: Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, and Chinese.

Using the vLLM Backend

recorder = AudioToTextRecorder(
    transcription_engine="qwen3_asr",
    model="Qwen/Qwen3-ASR-1.7B",
    transcription_engine_options={
        "backend": "vllm",
    },
)

Qwen3 ASR Options Reference

Option	Meaning
`engine_options["backend"]`	`"transformers"` (default) or `"vllm"`.
`engine_options["model"]`	Model loader options. On the vLLM path, the `model` key is passed to `LLM(...)`.
`engine_options["transcribe"]`	Transcription options merged into `model.transcribe(...)`.
`engine_options["language"]`	Language when not passed through the top-level `language` option.
`engine_options["return_time_stamps"]`	Request timestamps where supported.
`engine_options["sample_rate"]`	Sample rate for in-memory audio. Defaults to `16000`.

Moonshine (Transformers)

Moonshine uses the same Transformers infrastructure. See the Moonshine engine page for install instructions, model options, and the sherpa-onnx CPU INT8 alternative path.

Resource Considerations

Transformers-backed engines generally have:

Larger model downloads than the default faster-whisper path
Higher memory requirements, especially on GPU
Stricter CUDA/PyTorch compatibility — start from a clean virtual environment if you hit dependency conflicts

For server deployments, use the FastAPI server’s shared inference lanes instead of instantiating one model per session.

Troubleshooting

transformers missing a required class — upgrade Transformers to the latest release.
Model download failures — check Hugging Face network access, confirm the model’s gated access terms have been accepted, and set download_root to a writable directory.
CUDA out of memory — reduce model size, switch to CPU, or run one shared model lane in the server.
Qwen vLLM fails on Windows — move the run to Linux or WSL2.

Get Started

Guides

Transcription Engines

Resources

Hugging Face Transformers ASR Engines for RealtimeSTT

Shared Installation

Granite Speech

Install

Basic Usage

Controlling the Model Cache

Granite Options Reference

Qwen3 ASR

Install

Basic Usage

Using the vLLM Backend

Qwen3 ASR Options Reference

Moonshine (Transformers)

Resource Considerations

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Guides

Transcription Engines

Resources

Documentation Index

​Shared Installation

​Granite Speech

​Install

​Basic Usage

​Controlling the Model Cache

​Granite Options Reference

​Qwen3 ASR

​Install

​Basic Usage

​Using the vLLM Backend

​Qwen3 ASR Options Reference

​Moonshine (Transformers)

​Resource Considerations

​Troubleshooting

Build docs developers (and LLMs) love

Shared Installation

Granite Speech

Install

Basic Usage

Controlling the Model Cache

Granite Options Reference

Qwen3 ASR

Install

Basic Usage

Using the vLLM Backend

Qwen3 ASR Options Reference

Moonshine (Transformers)

Resource Considerations

Troubleshooting