The vLLM backend unlocks Qwen3-ASR’s full throughput potential. By routing all generation through vLLM’s continuous batching engine, you can process hundreds of audio files concurrently while keeping GPU utilization near 100%. The vLLM backend is the recommended choice for any production or high-volume workload, and it is the only backend that supports streaming transcription.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
When to Use vLLM vs. Transformers
vLLM backend
- Large-batch offline transcription
- Low-latency server deployments
- Streaming real-time transcription
- Concurrency of 128+ requests
Transformers backend
- Minimal dependencies
- Single-GPU experimentation
- Fine-tuning or custom hooks
- Environments where vLLM cannot be installed
Installation
The vLLM backend ships as an optional extra. Install it alongside the baseqwen-asr package:
Loading with Qwen3ASRModel.LLM
Use the Qwen3ASRModel.LLM class method to initialize the vLLM backend. This internally creates a vllm.LLM instance and registers the Qwen3-ASR model architecture.
Parameters
Hugging Face repository ID (e.g.
"Qwen/Qwen3-ASR-1.7B") or a local directory path. Passed directly to vllm.LLM(model=...).Repository ID or local path for
Qwen3ForcedAligner (e.g. "Qwen/Qwen3-ForcedAligner-0.6B"). Required when you intend to call transcribe(..., return_time_stamps=True).Keyword arguments forwarded to
Qwen3ForcedAligner.from_pretrained(...). Typically includes dtype and device_map.Maximum number of audio chunks submitted to vLLM in a single
generate call. The default -1 means unlimited — vLLM handles its own internal batching. Set a positive value to limit memory usage when inputs are very long.Maximum tokens to generate per audio chunk. The vLLM backend defaults to
4096, which is suitable for audio up to several minutes long.All remaining keyword arguments are forwarded to
vllm.LLM(...). Useful options include gpu_memory_utilization (float, default 0.9), tensor_parallel_size (int), and dtype.Batch Transcription
Thetranscribe method accepts the same audio input formats as the Transformers backend: URL strings, local file paths, base64 data URLs, and (np.ndarray, sr) tuples. Mix them freely in a single batch.
Getting Timestamps
Load the model with aforced_aligner and set return_time_stamps=True. The aligner model runs on the CPU/GPU device you specify in forced_aligner_kwargs, independently of vLLM’s GPU pool.
The if __name__ == '__main__': Guard
Serving via qwen-asr-serve
You can also deploy Qwen3-ASR as an OpenAI-compatible HTTP server using the bundled qwen-asr-serve command, which wraps vllm serve:
parse_asr_output utility:
parse_asr_output Utility
parse_asr_output(raw, user_language=None) parses a raw model output string into a (language, text) tuple. It handles the "language X<asr_text>..." format produced by the model, strips repetition artifacts, and falls back gracefully when the tag is absent. Import it directly from qwen_asr: