High-Throughput Batch Inference with the vLLM Backend

The vLLM backend unlocks Qwen3-ASR’s full throughput potential. By routing all generation through vLLM’s continuous batching engine, you can process hundreds of audio files concurrently while keeping GPU utilization near 100%. The vLLM backend is the recommended choice for any production or high-volume workload, and it is the only backend that supports streaming transcription.

When to Use vLLM vs. Transformers

vLLM backend

Large-batch offline transcription
Low-latency server deployments
Streaming real-time transcription
Concurrency of 128+ requests

Transformers backend

Minimal dependencies
Single-GPU experimentation
Fine-tuning or custom hooks
Environments where vLLM cannot be installed

Installation

The vLLM backend ships as an optional extra. Install it alongside the base qwen-asr package:

pip install -U qwen-asr[vllm]

For even faster forced-alignment inference when using timestamps, also install FlashAttention 2:

pip install -U flash-attn --no-build-isolation

Loading with `Qwen3ASRModel.LLM`

Use the Qwen3ASRModel.LLM class method to initialize the vLLM backend. This internally creates a vllm.LLM instance and registers the Qwen3-ASR model architecture.

import torch
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.8,
        max_inference_batch_size=32,
        max_new_tokens=1024,
    )

Parameters

model

str

required

Hugging Face repository ID (e.g. "Qwen/Qwen3-ASR-1.7B") or a local directory path. Passed directly to vllm.LLM(model=...).

forced_aligner

str

default:"None"

Repository ID or local path for Qwen3ForcedAligner (e.g. "Qwen/Qwen3-ForcedAligner-0.6B"). Required when you intend to call transcribe(..., return_time_stamps=True).

forced_aligner_kwargs

dict

default:"None"

Keyword arguments forwarded to Qwen3ForcedAligner.from_pretrained(...). Typically includes dtype and device_map.

max_inference_batch_size

int

default:"-1"

Maximum number of audio chunks submitted to vLLM in a single generate call. The default -1 means unlimited — vLLM handles its own internal batching. Set a positive value to limit memory usage when inputs are very long.

max_new_tokens

int

default:"4096"

Maximum tokens to generate per audio chunk. The vLLM backend defaults to 4096, which is suitable for audio up to several minutes long.

**kwargs

any

All remaining keyword arguments are forwarded to vllm.LLM(...). Useful options include gpu_memory_utilization (float, default 0.9), tensor_parallel_size (int), and dtype.

Batch Transcription

The transcribe method accepts the same audio input formats as the Transformers backend: URL strings, local file paths, base64 data URLs, and (np.ndarray, sr) tuples. Mix them freely in a single batch.

import torch
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.8,
        max_inference_batch_size=32,
        max_new_tokens=1024,
    )

    results = model.transcribe(
        audio=[
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
        ],
        language=[None, "English"],  # None = auto-detect
    )

    for i, r in enumerate(results):
        print(f"[{i}] {r.language}: {r.text}")

Getting Timestamps

Load the model with a forced_aligner and set return_time_stamps=True. The aligner model runs on the CPU/GPU device you specify in forced_aligner_kwargs, independently of vLLM’s GPU pool.

import torch
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="Qwen/Qwen3-ASR-1.7B",
        gpu_memory_utilization=0.7,
        max_inference_batch_size=128,
        max_new_tokens=4096,
        forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
        forced_aligner_kwargs=dict(
            dtype=torch.bfloat16,
            device_map="cuda:0",
            # attn_implementation="flash_attention_2",
        ),
    )

    results = model.transcribe(
        audio=[
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
            "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
        ],
        language=["Chinese", "English"],
        return_time_stamps=True,
    )

    for r in results:
        print(r.language, r.text, r.time_stamps[0])

The `if name == 'main':` Guard

Always wrap your vLLM inference code inside if __name__ == '__main__':. vLLM uses Python multiprocessing with the spawn start method. Without this guard, worker processes will re-execute the top-level module, causing a recursive import loop that crashes with a RuntimeError or hangs indefinitely. This is a documented requirement from vLLM Troubleshooting.

# ✅ Correct
if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(model="Qwen/Qwen3-ASR-1.7B")
    results = model.transcribe(audio="audio.wav")

# ❌ Will crash with spawn multiprocessing
model = Qwen3ASRModel.LLM(model="Qwen/Qwen3-ASR-1.7B")
results = model.transcribe(audio="audio.wav")

Serving via `qwen-asr-serve`

You can also deploy Qwen3-ASR as an OpenAI-compatible HTTP server using the bundled qwen-asr-serve command, which wraps vllm serve:

qwen-asr-serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000

Send transcription requests to the server and parse the response using the parse_asr_output utility:

import requests
from qwen_asr import parse_asr_output

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

data = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
                    },
                }
            ],
        }
    ]
}

response = requests.post(url, headers=headers, json=data, timeout=300)
response.raise_for_status()
content = response.json()['choices'][0]['message']['content']

# Parse the raw model output into structured (language, text)
language, text = parse_asr_output(content)
print(language)
print(text)

`parse_asr_output` Utility

parse_asr_output(raw, user_language=None) parses a raw model output string into a (language, text) tuple. It handles the "language X<asr_text>..." format produced by the model, strips repetition artifacts, and falls back gracefully when the tag is absent. Import it directly from qwen_asr:

from qwen_asr import parse_asr_output

language, text = parse_asr_output("language English<asr_text>Hello world.")
# language = "English", text = "Hello world."

Get Started

Inference

Deployment

Fine-Tuning

Reference

High-Throughput Batch Inference with the vLLM Backend

When to Use vLLM vs. Transformers

vLLM backend

Transformers backend

Installation

Loading with `Qwen3ASRModel.LLM`

Parameters

Batch Transcription

Getting Timestamps

The `if name == 'main':` Guard

Serving via `qwen-asr-serve`

`parse_asr_output` Utility

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​When to Use vLLM vs. Transformers

vLLM backend

Transformers backend

​Installation

​Loading with Qwen3ASRModel.LLM

​Parameters

​Batch Transcription

​Getting Timestamps

​The if __name__ == '__main__': Guard

​Serving via qwen-asr-serve

​parse_asr_output Utility

Build docs developers (and LLMs) love

When to Use vLLM vs. Transformers

Installation

Loading with `Qwen3ASRModel.LLM`

Parameters

Batch Transcription

Getting Timestamps

The `if name == 'main':` Guard

Serving via `qwen-asr-serve`

`parse_asr_output` Utility