Deploy Qwen3-ASR as an OpenAI-Compatible API Server

Qwen3-ASR integrates natively with vLLM, giving you a production-ready, OpenAI-compatible HTTP server for speech recognition. Once the server is running you can call it with the OpenAI Python SDK, plain requests, or cURL — no custom client code required.

Installation

vLLM provides day-0 support for Qwen3-ASR. Use uv to install the nightly wheel along with the extra audio dependencies:

uv venv
source .venv/bin/activate
uv pip install -U vllm --pre \
    --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
    --extra-index-url https://download.pytorch.org/whl/cu129 \
    --index-strategy unsafe-best-match
uv pip install "vllm[audio]"   # additional audio dependencies

The qwen-asr-serve CLI command (described below) requires the vllm extra from the qwen-asr package. Install it with:

pip install -U "qwen-asr[vllm]"

Without this extra, qwen-asr-serve will raise an ImportError on startup.

Starting the Server

You have two ways to launch the inference server: Option 1 — qwen-asr-serve (recommended) qwen-asr-serve is a thin wrapper around vllm serve that automatically registers the Qwen3-ASR model architecture. It accepts every flag that vllm serve supports:

qwen-asr-serve Qwen/Qwen3-ASR-1.7B \
  --gpu-memory-utilization 0.8 \
  --host 0.0.0.0 \
  --port 8000

Option 2 — direct vllm serve If you have already installed vLLM and registered the model separately, you can invoke vLLM directly:

vllm serve Qwen/Qwen3-ASR-1.7B

Both commands expose the same OpenAI-compatible endpoints on the configured host and port.

Sending Requests

After the server is running, you can query it using the OpenAI SDK, the transcription API, or cURL.

import base64
import httpx
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-ASR-1.7B",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
                    },
                }
            ],
        }
    ],
)

print(response.choices[0].message.content)

Parsing the Response

The raw model output encodes both the detected language and the transcription text in a structured format. Use parse_asr_output from the qwen_asr package to split them apart:

from qwen_asr import parse_asr_output

# `content` is the string from response.choices[0].message.content
language, text = parse_asr_output(content)
print(language)   # e.g. "English"
print(text)       # the transcribed text

Offline Inference with `vllm.LLM`

For batch processing without an HTTP server, use vLLM’s LLM class directly. Wrap your script in a __main__ guard to avoid multiprocessing issues:

from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

# Initialize the LLM
llm = LLM(model="Qwen/Qwen3-ASR-1.7B")

# Load a bundled vLLM audio asset (or supply your own URL)
audio_asset = AudioAsset("winning_call")

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "audio_url",
                "audio_url": {"url": audio_asset.url},
            }
        ],
    }
]

sampling_params = SamplingParams(temperature=0.01, max_tokens=256)

outputs = llm.chat(conversation, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Always wrap vLLM offline inference code in if __name__ == "__main__": to prevent the spawn error that arises from Python’s multiprocessing model. See the vLLM Troubleshooting guide for details.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Deploy Qwen3-ASR as an OpenAI-Compatible API Server

Installation

Starting the Server

Sending Requests

Parsing the Response

Offline Inference with `vllm.LLM`

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​Installation

​Starting the Server

​Sending Requests

​Parsing the Response

​Offline Inference with vllm.LLM

Build docs developers (and LLMs) love

Installation

Starting the Server

Sending Requests

Parsing the Response

Offline Inference with `vllm.LLM`