Qwen3-ASR integrates natively with vLLM, giving you a production-ready, OpenAI-compatible HTTP server for speech recognition. Once the server is running you can call it with the OpenAI Python SDK, plainDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
requests, or cURL — no custom client code required.
Installation
vLLM provides day-0 support for Qwen3-ASR. Useuv to install the nightly wheel along with the extra audio dependencies:
Starting the Server
You have two ways to launch the inference server: Option 1 —qwen-asr-serve (recommended)
qwen-asr-serve is a thin wrapper around vllm serve that automatically registers the Qwen3-ASR model architecture. It accepts every flag that vllm serve supports:
vllm serve
If you have already installed vLLM and registered the model separately, you can invoke vLLM directly:
Sending Requests
After the server is running, you can query it using the OpenAI SDK, the transcription API, or cURL.Parsing the Response
The raw model output encodes both the detected language and the transcription text in a structured format. Useparse_asr_output from the qwen_asr package to split them apart:
Offline Inference with vllm.LLM
For batch processing without an HTTP server, use vLLM’s LLM class directly. Wrap your script in a __main__ guard to avoid multiprocessing issues:
Always wrap vLLM offline inference code in
if __name__ == "__main__": to prevent the spawn error that arises from Python’s multiprocessing model. See the vLLM Troubleshooting guide for details.