Installing Qwen3-ASR: pip, conda, Source, and Docker

The qwen-asr package is the official Python library for running Qwen3-ASR models. It supports Python 3.9 through 3.13, ships two inference backends (HuggingFace Transformers and vLLM), and is available on PyPI under the name qwen-asr. This page covers every installation path from a one-line pip install to a full from-source development setup.

Requirements

Before installing, make sure your environment meets the following prerequisites.

Requirement	Minimum version	Notes
Python	3.9	3.12 recommended (used in official Docker image)
CUDA GPU	Any CUDA-compatible GPU	Required for model inference
CUDA toolkit	12.8 (Docker)	Lower versions may work with the Transformers backend
PyTorch	Installed by `transformers`	`torch.bfloat16` or `torch.float16` required for FlashAttention 2

Key runtime dependencies (installed automatically by pip):

transformers==4.57.6
accelerate==1.12.0
qwen-omni-utils
librosa, soundfile, sox (audio I/O)
nagisa==0.2.11, soynlp==0.0.493 (Japanese/Korean tokenisation)
pytz (timezone handling)
gradio, flask (web demo CLI commands)

The optional vllm extra adds vllm==0.14.0.

Installing with pip

Transformers backend
vLLM backend

Install the minimal package with HuggingFace Transformers support:

pip install -U qwen-asr

This is the right choice for single-GPU workloads where you want the simplest possible setup.

Install the package with the optional vLLM extra for maximum throughput, async serving, and streaming inference:

pip install -U qwen-asr[vllm]

vLLM uses Python multiprocessing with the spawn start method. You must wrap all vLLM-based code under if __name__ == '__main__': to avoid a RuntimeError. See the vLLM troubleshooting docs for details.

Setting Up a Conda Environment

We strongly recommend using a clean, isolated environment to avoid dependency conflicts with other packages.

Create and activate a fresh environment

Python 3.12 is the version used in the official Docker image and is the recommended choice:

conda create -n qwen3-asr python=3.12 -y
conda activate qwen3-asr

Install qwen-asr

Choose the install variant that matches your intended backend:

pip install -U qwen-asr

Verify the installation

Confirm the package is importable and check the public API:

from qwen_asr import Qwen3ASRModel, Qwen3ForcedAligner, parse_asr_output
print("qwen-asr installed successfully")

Installing from Source

If you want to modify the package code, contribute to the project, or test unreleased changes, install from source in editable mode.

Clone the repository

git clone https://github.com/QwenLM/Qwen3-ASR.git
cd Qwen3-ASR

Install in editable mode

pip install -e .

The -e flag means changes you make to the source files take effect immediately without reinstalling.

FlashAttention 2 (Optional)

FlashAttention 2 is optional but significantly reduces GPU memory usage and speeds up inference, especially for long audio and large batch sizes. It is also the recommended way to accelerate the Qwen3-ForcedAligner-0.6B model when timestamps are required.

FlashAttention 2 requires the model to be loaded in torch.float16 or torch.bfloat16. It is not compatible with torch.float32. Enable it by passing attn_implementation="flash_attention_2" to from_pretrained.

Install FlashAttention 2 with:

pip install -U flash-attn --no-build-isolation

If your machine has less than 96 GB of RAM or many CPU cores causing OOM during the C++ compilation step, limit the parallel build jobs:

MAX_JOBS=4 pip install -U flash-attn --no-build-isolation

For full hardware compatibility information, refer to the FlashAttention repository.

Downloading Model Weights Manually

By default, model weights are downloaded automatically from HuggingFace Hub the first time you call Qwen3ASRModel.from_pretrained(...) or Qwen3ASRModel.LLM(...). If your runtime environment does not have internet access, pre-download the weights to a local directory and pass that path instead of the model name.

ModelScope (recommended in Mainland China)
Hugging Face CLI

pip install -U modelscope
modelscope download --model Qwen/Qwen3-ASR-1.7B  --local_dir ./Qwen3-ASR-1.7B
modelscope download --model Qwen/Qwen3-ASR-0.6B --local_dir ./Qwen3-ASR-0.6B
modelscope download --model Qwen/Qwen3-ForcedAligner-0.6B --local_dir ./Qwen3-ForcedAligner-0.6B

pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir ./Qwen3-ASR-1.7B
huggingface-cli download Qwen/Qwen3-ASR-0.6B --local-dir ./Qwen3-ASR-0.6B
huggingface-cli download Qwen/Qwen3-ForcedAligner-0.6B --local-dir ./Qwen3-ForcedAligner-0.6B

Once downloaded, pass the local directory path as the model name:

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "./Qwen3-ASR-1.7B",   # local path instead of "Qwen/Qwen3-ASR-1.7B"
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

Using the Official Docker Image

For the simplest possible setup — no driver configuration and no dependency management — use the pre-built Docker image qwenllm/qwen3-asr. It includes Python 3, CUDA 12.8, qwen-asr[vllm], and FlashAttention 2.

LOCAL_WORKDIR=/path/to/your/workspace
HOST_PORT=8000
CONTAINER_PORT=80

docker run --gpus all --name qwen3-asr \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -p $HOST_PORT:$CONTAINER_PORT \
    --mount type=bind,source=$LOCAL_WORKDIR,target=/data/shared/Qwen3-ASR \
    --shm-size=4gb \
    -it qwenllm/qwen3-asr:latest

Replace /path/to/your/workspace with your actual local workspace path. Services inside the container must bind to 0.0.0.0 for the port mapping to work. To re-enter a stopped container:

docker start qwen3-asr
docker exec -it qwen3-asr bash

The NVIDIA Container Toolkit must be installed on the host before Docker can access your GPU. Follow the NVIDIA Container Toolkit installation guide if you have not done so already.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Installing Qwen3-ASR: pip, conda, Source, and Docker

Requirements

Installing with pip

Setting Up a Conda Environment

Installing from Source

FlashAttention 2 (Optional)

Downloading Model Weights Manually

Using the Official Docker Image

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​Requirements

​Installing with pip

​Setting Up a Conda Environment

​Installing from Source

​FlashAttention 2 (Optional)

​Downloading Model Weights Manually

​Using the Official Docker Image

Build docs developers (and LLMs) love

Requirements

Installing with pip

Setting Up a Conda Environment

Installing from Source

FlashAttention 2 (Optional)

Downloading Model Weights Manually

Using the Official Docker Image