Qwen3-ASR CLI Reference: demo, streaming, and serve

The qwen-asr package installs three command-line entry points that cover the main deployment scenarios: an interactive Gradio web UI, a minimal Flask-based streaming demo, and a vLLM-powered inference server. This page documents every flag and provides ready-to-run examples for each command.

All three commands are installed automatically when you run pip install qwen-asr. The vLLM backend (qwen-asr-demo --backend vllm and qwen-asr-serve) additionally requires pip install qwen-asr[vllm].

qwen-asr-demo
qwen-asr-demo-streaming
qwen-asr-serve

qwen-asr-demo

Launches a Gradio web UI demo backed by either the transformers or vllm inference backend. The UI lets users upload audio, choose a language, and optionally enable timestamp visualization when a ForcedAligner checkpoint is provided.

Flag Reference

--asr-checkpoint

string

required

Path to a local model directory or a HuggingFace repository ID for the Qwen3-ASR model.Example: Qwen/Qwen3-ASR-1.7B or ./Qwen3-ASR-1.7B

--aligner-checkpoint

string

Path to a local directory or HuggingFace repository ID for the Qwen3-ForcedAligner-0.6B model. Optional. When provided, the UI displays a timestamps panel and a visualization button.Example: Qwen/Qwen3-ForcedAligner-0.6B

--backend

string

default:"transformers"

Inference backend for the ASR model. Accepted values: transformers, vllm.

--cuda-visible-devices

string

default:"0"

Sets CUDA_VISIBLE_DEVICES for the demo process. Use a comma-separated list of GPU indices (e.g., 0 or 1). Because vLLM does not follow the cuda:0 device selection style, this flag is the recommended way to control which GPU is used.

--backend-kwargs

string

JSON dict of backend-specific keyword arguments passed to the model loader, excluding the checkpoint path. Merged over sensible defaults.Transformers default: {"device_map": "cuda:0", "dtype": "bfloat16", "max_inference_batch_size": 4, "max_new_tokens": 512}vLLM default: {"gpu_memory_utilization": 0.8, "max_inference_batch_size": 4, "max_new_tokens": 4096}

--aligner-kwargs

string

JSON dict of keyword arguments for the ForcedAligner model loader. Only used when --aligner-checkpoint is set.Default: {"dtype": "bfloat16", "device_map": "cuda:0"}

--ip

string

default:"0.0.0.0"

Server bind IP address for the Gradio server.

--port

integer

default:"8000"

Server port for the Gradio server.

--concurrency

integer

default:"16"

Gradio queue concurrency limit — the maximum number of requests processed simultaneously.

Whether to create a public Gradio sharing link. Disabled by default.

--ssl-certfile

string

Path to an SSL certificate file (PEM format) for serving over HTTPS. Required to avoid browser microphone permission issues when accessed remotely.

--ssl-keyfile

string

Path to the SSL private key file (PEM format) matching --ssl-certfile.

--ssl-verify / --no-ssl-verify

boolean

default:"true"

Whether to verify the SSL certificate. Pass --no-ssl-verify when using a self-signed certificate.

Usage Examples

qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --ip 0.0.0.0 --port 8000

After launching, open http://<your-ip>:8000 (or https:// when using SSL) in your browser, or use port forwarding in VS Code to access it locally.

Timestamps are only shown in the UI when --aligner-checkpoint is provided. Without it, the timestamps panel is hidden automatically.

qwen-asr-demo-streaming

Launches a minimal Flask-based streaming web demo. The browser captures microphone audio, resamples it to 16,000 Hz, and continuously pushes PCM chunks to the model for real-time transcription. This command requires the vLLM backend — install with pip install qwen-asr[vllm].

Streaming inference does not support batch processing or timestamp generation. Use qwen-asr-demo if you need those features.

Flag Reference

--asr-model-path

string

default:"Qwen/Qwen3-ASR-1.7B"

Model name (HuggingFace repository ID) or path to a local model directory.

--gpu-memory-utilization

float

default:"0.8"

Fraction of GPU memory to allocate for the vLLM engine (0.0 – 1.0).

--host

string

default:"0.0.0.0"

Host address for the Flask server to bind to.

--port

integer

default:"8000"

Port for the Flask server.

--unfixed-chunk-num

integer

default:"4"

Number of unfixed audio chunks retained in the streaming context window.

--unfixed-token-num

integer

default:"5"

Number of unfixed tokens retained at the streaming boundary.

--chunk-size-sec

float

default:"1.0"

Duration of each pushed audio chunk in seconds.

Usage Example

qwen-asr-demo-streaming \
  --asr-model-path Qwen/Qwen3-ASR-1.7B \
  --gpu-memory-utilization 0.9 \
  --host 0.0.0.0 \
  --port 8000

After launching, open http://<your-ip>:8000 in your browser. Click Start to begin microphone capture and real-time transcription. The current transcript and detected language update continuously as audio is received.

qwen-asr-serve

Starts a vLLM OpenAI-compatible inference server for Qwen3-ASR. This command is a thin wrapper around vllm serve — it registers the Qwen3-ASR model architecture and then passes all arguments through to the vLLM CLI unchanged. Every argument supported by vllm serve is also accepted here.Requires pip install qwen-asr[vllm].

Usage Examples

Basic server on port 8000

qwen-asr-serve Qwen/Qwen3-ASR-1.7B \
  --gpu-memory-utilization 0.8 \
  --host 0.0.0.0 \
  --port 8000

0.6B model with higher GPU utilization

qwen-asr-serve Qwen/Qwen3-ASR-0.6B \
  --gpu-memory-utilization 0.9 \
  --host 0.0.0.0 \
  --port 8000

Local model directory

qwen-asr-serve ./Qwen3-ASR-1.7B \
  --gpu-memory-utilization 0.8 \
  --host 0.0.0.0 \
  --port 8000

Sending requests to the server

Once the server is running, you can send audio for transcription using the standard OpenAI chat completions format or the OpenAI audio transcriptions API.

import requests

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

data = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
                    },
                }
            ],
        }
    ]
}

response = requests.post(url, headers=headers, json=data, timeout=300)
response.raise_for_status()
content = response.json()["choices"][0]["message"]["content"]
print(content)

# Optionally parse the structured ASR output
from qwen_asr import parse_asr_output
language, text = parse_asr_output(content)
print(language)
print(text)

qwen-asr-serve passes all arguments directly to vllm serve. Refer to the vLLM documentation for the full list of supported flags such as --max-model-len, --tensor-parallel-size, --dtype, and more.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Qwen3-ASR CLI Reference: demo, streaming, and serve

qwen-asr-demo

Flag Reference

Usage Examples

qwen-asr-demo-streaming

Flag Reference

Usage Example

qwen-asr-serve

Usage Examples

Sending requests to the server

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​qwen-asr-demo

​Flag Reference

​Usage Examples

​qwen-asr-demo-streaming

​Flag Reference

​Usage Example

​qwen-asr-serve

​Usage Examples

​Sending requests to the server

Build docs developers (and LLMs) love

qwen-asr-demo

Flag Reference

Usage Examples

qwen-asr-demo-streaming

Flag Reference

Usage Example

qwen-asr-serve

Usage Examples

Sending requests to the server