Quickstart: Transcribe Audio with Qwen3-ASR in Minutes

This guide walks you through installing the qwen-asr package, loading a model, and running your first transcription — all in about 5 minutes. By the end you will have working code that accepts any audio URL or local file, automatically detects the spoken language, and returns the full transcript. The same API works for both the lightweight 0.6B and the flagship 1.7B checkpoints.

Qwen3-ASR requires a CUDA-capable GPU. The 1.7B model fits comfortably in 8 GB of VRAM with torch.bfloat16. The 0.6B model runs in around 3 GB of VRAM under the same dtype.

Steps

Install the package

Install qwen-asr from PyPI. The base installation pulls in the HuggingFace Transformers backend and all required runtime dependencies.

pip install -U qwen-asr

If you want the vLLM backend for faster batch inference and streaming support, install the optional extra instead:

pip install -U qwen-asr[vllm]

We recommend creating a fresh conda environment first to avoid dependency conflicts. See the Installation guide for a step-by-step environment setup.

Load a model

Choose either the Transformers backend (simple, single-GPU) or the vLLM backend (high-throughput, streaming). Both expose the same model.transcribe(...) interface.

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    # attn_implementation="flash_attention_2",  # optional, requires flash-attn
    max_inference_batch_size=32,  # set -1 for unlimited; lower values reduce OOM risk
    max_new_tokens=256,           # increase for very long audio
)

Model weights are downloaded automatically from HuggingFace on first use. For offline environments, see Downloading model weights manually.

Transcribe audio

Call model.transcribe() with a URL, local path, base64 string, or a (np.ndarray, sr) tuple. Pass a list to run batch inference.

# Single file — automatic language detection
results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    language=None,  # set e.g. "English" to skip auto-detection
)

# Batch inference — mix URLs and local paths freely
results = model.transcribe(
    audio=[
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
        "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    ],
    language=["Chinese", "English"],  # or None for auto-detection per file
)

Read the results

Each element of the returned list has a .language and a .text attribute.

# Single result
print(results[0].language)  # e.g. "English"
print(results[0].text)      # e.g. "Hello, this is a test."

# Batch results
for r in results:
    print(r.language, r.text)

The complete single-file example from start to finish:

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "Qwen/Qwen3-ASR-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
    language=None,
)

print(results[0].language)
print(results[0].text)

Parsing raw vLLM server output

When you query a deployed vLLM server directly via HTTP (for example, through the OpenAI chat completions endpoint), the model returns a raw string. Use parse_asr_output to split it into (language, text):

import requests
from qwen_asr import parse_asr_output

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

data = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
                    },
                }
            ],
        }
    ]
}

response = requests.post(url, headers=headers, json=data, timeout=300)
response.raise_for_status()
content = response.json()['choices'][0]['message']['content']

language, text = parse_asr_output(content)
print(language)
print(text)

Next Steps

Transformers Backend

Deep-dive into batch inference, FlashAttention 2, and timestamp extraction with the Transformers backend.

vLLM Backend

Configure GPU memory utilisation, async serving, and the OpenAI-compatible REST API.

Forced Aligner

Add word- and character-level timestamps to any transcription.

Installation

Set up conda environments, install from source, and manage model weights.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Quickstart: Transcribe Audio with Qwen3-ASR in Minutes

Steps

Parsing raw vLLM server output

Next Steps

Transformers Backend

vLLM Backend

Forced Aligner

Installation

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​Steps

​Parsing raw vLLM server output

​Next Steps

Transformers Backend

vLLM Backend

Forced Aligner

Installation

Build docs developers (and LLMs) love

Steps

Parsing raw vLLM server output

Next Steps