The Transformers backend is the simplest way to run Qwen3-ASR. It relies entirely on the standard Hugging FaceDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
transformers stack, so you can get up and running with a single pip install qwen-asr. It is the recommended starting point for experimentation, fine-tuning workflows, and deployments where installing vLLM is not practical.
When to Use the Transformers Backend
Use Transformers when…
- You need a minimal, dependency-light setup
- You are running on a single GPU or CPU
- You are prototyping or evaluating the model
- You need
device_mapmulti-device placement
Consider vLLM when…
- You need maximum throughput at scale
- You are serving concurrent requests
- You need streaming transcription support
- Batch sizes exceed 64+ items regularly
Loading a Model
UseQwen3ASRModel.from_pretrained to load the model with the Transformers backend. Model weights are downloaded automatically from Hugging Face on first use.
Parameters
Hugging Face repository ID (e.g.
"Qwen/Qwen3-ASR-1.7B") or a local directory path containing the model weights and config.Repository ID or local path of a
Qwen3ForcedAligner model (e.g. "Qwen/Qwen3-ForcedAligner-0.6B"). Required when you intend to call transcribe(..., return_time_stamps=True). If omitted, timestamp requests will raise a ValueError.Keyword arguments forwarded verbatim to
Qwen3ForcedAligner.from_pretrained(...). Accepts the same keys as **kwargs here, such as dtype, device_map, and attn_implementation.Maximum number of audio chunks processed in a single forward pass. Set to
-1 to disable chunking and process all inputs at once. Reduce this value when encountering GPU out-of-memory errors, especially with long audio inputs.Maximum number of tokens the decoder may generate per chunk. The library default is
512. Increase this for very long audio inputs or dense speech; reduce it to speed up inference on short clips.All remaining keyword arguments are forwarded directly to
AutoModel.from_pretrained(...). Common options include dtype (e.g. torch.bfloat16), device_map (e.g. "cuda:0" or "auto"), and attn_implementation (e.g. "flash_attention_2").Basic Transcription
Pass a single audio file as a URL, local path, base64 data URL, or a(np.ndarray, sr) waveform tuple. The result is always a list of ASRTranscription objects, one per input.
transcribe Parameters
Audio input. Accepted formats:
str— local file path, HTTPS URL, or base64 data URL (data:audio/wav;base64,...)(np.ndarray, int)— tuple of a mono or multi-channel waveform and its sample ratelistof any of the above for batch inference
Optional context string(s) prepended to the system prompt. Useful for domain hints or vocabulary biasing. A single string is broadcast to the full batch.
Optional language override. When provided, the prompt is modified to force text-only output and skip language identification. Must be a canonical name from
model.get_supported_languages() (e.g. "Chinese", "English"). Pass None for automatic language detection.When
True, the model runs forced alignment after transcription and populates ASRTranscription.time_stamps with a ForcedAlignResult. Requires forced_aligner to have been provided at initialization.Batch Transcription
Pass a list of audio inputs to process multiple files in a single call. You can mix URL strings, base64 data URLs, and(np.ndarray, sr) tuples freely in the same batch. Per-sample context and language overrides are also supported.
Forcing a Language
Setlanguage to a canonical language name to skip language identification and request plain-text transcription output. This is slightly faster and avoids occasional misidentification on short clips.
model.get_supported_languages() to retrieve the full list of 30 supported languages and 22 Chinese dialects.
Getting Timestamps
To obtain word- or character-level timestamps, load the model with aforced_aligner and set return_time_stamps=True in transcribe. The aligner runs after ASR and populates result.time_stamps with a ForcedAlignResult.
When
return_time_stamps=True, the maximum audio length per chunk is capped at 180 seconds (MAX_FORCE_ALIGN_INPUT_SECONDS) instead of the usual 1200 seconds, because the forced aligner has a shorter input limit. Long audio is still split automatically.ASRTranscription Result Object
Each call totranscribe returns a list[ASRTranscription], one entry per input audio.
Memory and Performance Tips
Use bfloat16
Load the model with
dtype=torch.bfloat16. This halves memory compared to float32 with negligible accuracy impact on modern GPUs.Enable FlashAttention 2
Install FlashAttention 2 and pass
attn_implementation="flash_attention_2" to both from_pretrained and forced_aligner_kwargs. This significantly reduces memory and speeds up inference on long audio.Tune max_inference_batch_size
The default of
32 is a reasonable starting point. Reduce it if you hit OOM errors with long recordings. Set it to -1 to process all audio in one shot when you have ample VRAM and small inputs.