TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
qwen-asr package is the official Python library for running Qwen3-ASR models. It supports Python 3.9 through 3.13, ships two inference backends (HuggingFace Transformers and vLLM), and is available on PyPI under the name qwen-asr. This page covers every installation path from a one-line pip install to a full from-source development setup.
Requirements
Before installing, make sure your environment meets the following prerequisites.| Requirement | Minimum version | Notes |
|---|---|---|
| Python | 3.9 | 3.12 recommended (used in official Docker image) |
| CUDA GPU | Any CUDA-compatible GPU | Required for model inference |
| CUDA toolkit | 12.8 (Docker) | Lower versions may work with the Transformers backend |
| PyTorch | Installed by transformers | torch.bfloat16 or torch.float16 required for FlashAttention 2 |
transformers==4.57.6accelerate==1.12.0qwen-omni-utilslibrosa,soundfile,sox(audio I/O)nagisa==0.2.11,soynlp==0.0.493(Japanese/Korean tokenisation)pytz(timezone handling)gradio,flask(web demo CLI commands)
vllm extra adds vllm==0.14.0.
Installing with pip
- Transformers backend
- vLLM backend
Install the minimal package with HuggingFace Transformers support:This is the right choice for single-GPU workloads where you want the simplest possible setup.
Setting Up a Conda Environment
We strongly recommend using a clean, isolated environment to avoid dependency conflicts with other packages.Create and activate a fresh environment
Python 3.12 is the version used in the official Docker image and is the recommended choice:
Installing from Source
If you want to modify the package code, contribute to the project, or test unreleased changes, install from source in editable mode.FlashAttention 2 (Optional)
FlashAttention 2 is optional but significantly reduces GPU memory usage and speeds up inference, especially for long audio and large batch sizes. It is also the recommended way to accelerate theQwen3-ForcedAligner-0.6B model when timestamps are required.
FlashAttention 2 requires the model to be loaded in
torch.float16 or torch.bfloat16. It is not compatible with torch.float32. Enable it by passing attn_implementation="flash_attention_2" to from_pretrained.Downloading Model Weights Manually
By default, model weights are downloaded automatically from HuggingFace Hub the first time you callQwen3ASRModel.from_pretrained(...) or Qwen3ASRModel.LLM(...). If your runtime environment does not have internet access, pre-download the weights to a local directory and pass that path instead of the model name.
- ModelScope (recommended in Mainland China)
- Hugging Face CLI
Using the Official Docker Image
For the simplest possible setup — no driver configuration and no dependency management — use the pre-built Docker imageqwenllm/qwen3-asr. It includes Python 3, CUDA 12.8, qwen-asr[vllm], and FlashAttention 2.
/path/to/your/workspace with your actual local workspace path. Services inside the container must bind to 0.0.0.0 for the port mapping to work. To re-enter a stopped container: