Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

Qwen3-ASR integrates natively with vLLM, giving you a production-ready, OpenAI-compatible HTTP server for speech recognition. Once the server is running you can call it with the OpenAI Python SDK, plain requests, or cURL — no custom client code required.

Installation

vLLM provides day-0 support for Qwen3-ASR. Use uv to install the nightly wheel along with the extra audio dependencies:
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
    --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
    --extra-index-url https://download.pytorch.org/whl/cu129 \
    --index-strategy unsafe-best-match
uv pip install "vllm[audio]"   # additional audio dependencies
The qwen-asr-serve CLI command (described below) requires the vllm extra from the qwen-asr package. Install it with:
pip install -U "qwen-asr[vllm]"
Without this extra, qwen-asr-serve will raise an ImportError on startup.

Starting the Server

You have two ways to launch the inference server: Option 1 — qwen-asr-serve (recommended) qwen-asr-serve is a thin wrapper around vllm serve that automatically registers the Qwen3-ASR model architecture. It accepts every flag that vllm serve supports:
qwen-asr-serve Qwen/Qwen3-ASR-1.7B \
  --gpu-memory-utilization 0.8 \
  --host 0.0.0.0 \
  --port 8000
Option 2 — direct vllm serve If you have already installed vLLM and registered the model separately, you can invoke vLLM directly:
vllm serve Qwen/Qwen3-ASR-1.7B
Both commands expose the same OpenAI-compatible endpoints on the configured host and port.

Sending Requests

After the server is running, you can query it using the OpenAI SDK, the transcription API, or cURL.
import base64
import httpx
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-ASR-1.7B",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
                    },
                }
            ],
        }
    ],
)

print(response.choices[0].message.content)

Parsing the Response

The raw model output encodes both the detected language and the transcription text in a structured format. Use parse_asr_output from the qwen_asr package to split them apart:
from qwen_asr import parse_asr_output

# `content` is the string from response.choices[0].message.content
language, text = parse_asr_output(content)
print(language)   # e.g. "English"
print(text)       # the transcribed text

Offline Inference with vllm.LLM

For batch processing without an HTTP server, use vLLM’s LLM class directly. Wrap your script in a __main__ guard to avoid multiprocessing issues:
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

# Initialize the LLM
llm = LLM(model="Qwen/Qwen3-ASR-1.7B")

# Load a bundled vLLM audio asset (or supply your own URL)
audio_asset = AudioAsset("winning_call")

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "audio_url",
                "audio_url": {"url": audio_asset.url},
            }
        ],
    }
]

sampling_params = SamplingParams(temperature=0.01, max_tokens=256)

outputs = llm.chat(conversation, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Always wrap vLLM offline inference code in if __name__ == "__main__": to prevent the spawn error that arises from Python’s multiprocessing model. See the vLLM Troubleshooting guide for details.

Build docs developers (and LLMs) love