Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

The qwen-asr package is the official Python library for running Qwen3-ASR models. It supports Python 3.9 through 3.13, ships two inference backends (HuggingFace Transformers and vLLM), and is available on PyPI under the name qwen-asr. This page covers every installation path from a one-line pip install to a full from-source development setup.

Requirements

Before installing, make sure your environment meets the following prerequisites.
RequirementMinimum versionNotes
Python3.93.12 recommended (used in official Docker image)
CUDA GPUAny CUDA-compatible GPURequired for model inference
CUDA toolkit12.8 (Docker)Lower versions may work with the Transformers backend
PyTorchInstalled by transformerstorch.bfloat16 or torch.float16 required for FlashAttention 2
Key runtime dependencies (installed automatically by pip):
  • transformers==4.57.6
  • accelerate==1.12.0
  • qwen-omni-utils
  • librosa, soundfile, sox (audio I/O)
  • nagisa==0.2.11, soynlp==0.0.493 (Japanese/Korean tokenisation)
  • pytz (timezone handling)
  • gradio, flask (web demo CLI commands)
The optional vllm extra adds vllm==0.14.0.

Installing with pip

Install the minimal package with HuggingFace Transformers support:
pip install -U qwen-asr
This is the right choice for single-GPU workloads where you want the simplest possible setup.

Setting Up a Conda Environment

We strongly recommend using a clean, isolated environment to avoid dependency conflicts with other packages.
1

Create and activate a fresh environment

Python 3.12 is the version used in the official Docker image and is the recommended choice:
conda create -n qwen3-asr python=3.12 -y
conda activate qwen3-asr
2

Install qwen-asr

Choose the install variant that matches your intended backend:
pip install -U qwen-asr
3

Verify the installation

Confirm the package is importable and check the public API:
from qwen_asr import Qwen3ASRModel, Qwen3ForcedAligner, parse_asr_output
print("qwen-asr installed successfully")

Installing from Source

If you want to modify the package code, contribute to the project, or test unreleased changes, install from source in editable mode.
1

Clone the repository

git clone https://github.com/QwenLM/Qwen3-ASR.git
cd Qwen3-ASR
2

Install in editable mode

pip install -e .
The -e flag means changes you make to the source files take effect immediately without reinstalling.

FlashAttention 2 (Optional)

FlashAttention 2 is optional but significantly reduces GPU memory usage and speeds up inference, especially for long audio and large batch sizes. It is also the recommended way to accelerate the Qwen3-ForcedAligner-0.6B model when timestamps are required.
FlashAttention 2 requires the model to be loaded in torch.float16 or torch.bfloat16. It is not compatible with torch.float32. Enable it by passing attn_implementation="flash_attention_2" to from_pretrained.
Install FlashAttention 2 with:
pip install -U flash-attn --no-build-isolation
If your machine has less than 96 GB of RAM or many CPU cores causing OOM during the C++ compilation step, limit the parallel build jobs:
MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
For full hardware compatibility information, refer to the FlashAttention repository.

Downloading Model Weights Manually

By default, model weights are downloaded automatically from HuggingFace Hub the first time you call Qwen3ASRModel.from_pretrained(...) or Qwen3ASRModel.LLM(...). If your runtime environment does not have internet access, pre-download the weights to a local directory and pass that path instead of the model name. Once downloaded, pass the local directory path as the model name:
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "./Qwen3-ASR-1.7B",   # local path instead of "Qwen/Qwen3-ASR-1.7B"
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

Using the Official Docker Image

For the simplest possible setup — no driver configuration and no dependency management — use the pre-built Docker image qwenllm/qwen3-asr. It includes Python 3, CUDA 12.8, qwen-asr[vllm], and FlashAttention 2.
LOCAL_WORKDIR=/path/to/your/workspace
HOST_PORT=8000
CONTAINER_PORT=80

docker run --gpus all --name qwen3-asr \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -p $HOST_PORT:$CONTAINER_PORT \
    --mount type=bind,source=$LOCAL_WORKDIR,target=/data/shared/Qwen3-ASR \
    --shm-size=4gb \
    -it qwenllm/qwen3-asr:latest
Replace /path/to/your/workspace with your actual local workspace path. Services inside the container must bind to 0.0.0.0 for the port mapping to work. To re-enter a stopped container:
docker start qwen3-asr
docker exec -it qwen3-asr bash
The NVIDIA Container Toolkit must be installed on the host before Docker can access your GPU. Follow the NVIDIA Container Toolkit installation guide if you have not done so already.

Build docs developers (and LLMs) love