Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

The vLLM backend unlocks Qwen3-ASR’s full throughput potential. By routing all generation through vLLM’s continuous batching engine, you can process hundreds of audio files concurrently while keeping GPU utilization near 100%. The vLLM backend is the recommended choice for any production or high-volume workload, and it is the only backend that supports streaming transcription.

When to Use vLLM vs. Transformers

vLLM backend

  • Large-batch offline transcription
  • Low-latency server deployments
  • Streaming real-time transcription
  • Concurrency of 128+ requests

Transformers backend

  • Minimal dependencies
  • Single-GPU experimentation
  • Fine-tuning or custom hooks
  • Environments where vLLM cannot be installed

Installation

The vLLM backend ships as an optional extra. Install it alongside the base qwen-asr package:
pip install -U qwen-asr[vllm]
For even faster forced-alignment inference when using timestamps, also install FlashAttention 2:
pip install -U flash-attn --no-build-isolation

Loading with Qwen3ASRModel.LLM

Use the Qwen3ASRModel.LLM class method to initialize the vLLM backend. This internally creates a vllm.LLM instance and registers the Qwen3-ASR model architecture.
import torch
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.8,
        max_inference_batch_size=32,
        max_new_tokens=1024,
    )

Parameters

model
str
required
Hugging Face repository ID (e.g. "Qwen/Qwen3-ASR-1.7B") or a local directory path. Passed directly to vllm.LLM(model=...).
forced_aligner
str
default:"None"
Repository ID or local path for Qwen3ForcedAligner (e.g. "Qwen/Qwen3-ForcedAligner-0.6B"). Required when you intend to call transcribe(..., return_time_stamps=True).
forced_aligner_kwargs
dict
default:"None"
Keyword arguments forwarded to Qwen3ForcedAligner.from_pretrained(...). Typically includes dtype and device_map.
max_inference_batch_size
int
default:"-1"
Maximum number of audio chunks submitted to vLLM in a single generate call. The default -1 means unlimited — vLLM handles its own internal batching. Set a positive value to limit memory usage when inputs are very long.
max_new_tokens
int
default:"4096"
Maximum tokens to generate per audio chunk. The vLLM backend defaults to 4096, which is suitable for audio up to several minutes long.
**kwargs
any
All remaining keyword arguments are forwarded to vllm.LLM(...). Useful options include gpu_memory_utilization (float, default 0.9), tensor_parallel_size (int), and dtype.

Batch Transcription

The transcribe method accepts the same audio input formats as the Transformers backend: URL strings, local file paths, base64 data URLs, and (np.ndarray, sr) tuples. Mix them freely in a single batch.
import torch
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.8,
        max_inference_batch_size=32,
        max_new_tokens=1024,
    )

    results = model.transcribe(
        audio=[
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
        ],
        language=[None, "English"],  # None = auto-detect
    )

    for i, r in enumerate(results):
        print(f"[{i}] {r.language}: {r.text}")

Getting Timestamps

Load the model with a forced_aligner and set return_time_stamps=True. The aligner model runs on the CPU/GPU device you specify in forced_aligner_kwargs, independently of vLLM’s GPU pool.
import torch
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.7,
        max_inference_batch_size=128,
        max_new_tokens=4096,
        forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
        forced_aligner_kwargs=dict(
            dtype=torch.bfloat16,
            device_map="cuda:0",
            # attn_implementation="flash_attention_2",
        ),
    )

    results = model.transcribe(
        audio=[
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
        ],
        language=["Chinese", "English"],
        return_time_stamps=True,
    )

    for r in results:
        print(r.language, r.text, r.time_stamps[0])

The if __name__ == '__main__': Guard

Always wrap your vLLM inference code inside if __name__ == '__main__':. vLLM uses Python multiprocessing with the spawn start method. Without this guard, worker processes will re-execute the top-level module, causing a recursive import loop that crashes with a RuntimeError or hangs indefinitely. This is a documented requirement from vLLM Troubleshooting.
# ✅ Correct
if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(model="Qwen/Qwen3-ASR-1.7B")
    results = model.transcribe(audio="audio.wav")

# ❌ Will crash with spawn multiprocessing
model = Qwen3ASRModel.LLM(model="Qwen/Qwen3-ASR-1.7B")
results = model.transcribe(audio="audio.wav")

Serving via qwen-asr-serve

You can also deploy Qwen3-ASR as an OpenAI-compatible HTTP server using the bundled qwen-asr-serve command, which wraps vllm serve:
qwen-asr-serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000
Send transcription requests to the server and parse the response using the parse_asr_output utility:
import requests
from qwen_asr import parse_asr_output

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

data = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
                    },
                }
            ],
        }
    ]
}

response = requests.post(url, headers=headers, json=data, timeout=300)
response.raise_for_status()
content = response.json()['choices'][0]['message']['content']

# Parse the raw model output into structured (language, text)
language, text = parse_asr_output(content)
print(language)
print(text)

parse_asr_output Utility

parse_asr_output(raw, user_language=None) parses a raw model output string into a (language, text) tuple. It handles the "language X<asr_text>..." format produced by the model, strips repetition artifacts, and falls back gracefully when the tag is absent. Import it directly from qwen_asr:
from qwen_asr import parse_asr_output

language, text = parse_asr_output("language English<asr_text>Hello world.")
# language = "English", text = "Hello world."

Build docs developers (and LLMs) love