Launch the Qwen3-ASR Gradio and Streaming Web Demos

The qwen-asr package ships two ready-to-run web demos: a full-featured Gradio interface (qwen-asr-demo) and a minimal Flask streaming demo (qwen-asr-demo-streaming). Both let you transcribe audio from a browser without writing any application code.

Gradio Demo

The qwen-asr-demo command launches a Gradio web UI backed by either the transformers or vLLM inference engine. It supports file uploads, optional timestamp visualization, and HTTPS.

Basic Usage

# Minimal launch — transformers backend on GPU 0
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --ip 0.0.0.0 --port 8000

Then open http://<your-ip>:8000 in a browser, or use port forwarding in VS Code.

Flag Reference

Flag	Default	Description
`--asr-checkpoint`	(required)	Qwen3-ASR model checkpoint path or Hugging Face repo ID.
`--aligner-checkpoint`	`None`	Qwen3-ForcedAligner checkpoint (enables timestamps when provided).
`--backend`	`transformers`	Inference backend: `transformers` or `vllm`.
`--cuda-visible-devices`	`0`	GPU index to expose to the demo process.
`--backend-kwargs`	`None`	JSON dict of backend-specific init arguments.
`--aligner-kwargs`	`None`	JSON dict of forced aligner init arguments.
`--ip`	`0.0.0.0`	Server bind address.
`--port`	`8000`	Server port.
`--ssl-certfile`	`None`	Path to SSL certificate file (enables HTTPS).
`--ssl-keyfile`	`None`	Path to SSL private key file (enables HTTPS).
`--no-ssl-verify`	—	Disable SSL certificate verification (useful for self-signed certs).
`--share`	`false`	Create a public Gradio share link (disabled by default).
`--concurrency`	`16`	Gradio queue concurrency limit.

Choosing a Backend

All backend-specific initialization parameters are passed via --backend-kwargs as a JSON string. If not provided, the demo uses sensible defaults.

Transformers Backend
vLLM Backend

The transformers backend is the simplest to set up and is recommended for development or single-GPU use.

# Basic transformers launch
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --ip 0.0.0.0 --port 8000

Override init arguments with --backend-kwargs:

# Enable FlashAttention 2 for reduced memory usage
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --backend-kwargs '{"device_map":"cuda:0","dtype":"bfloat16","attn_implementation":"flash_attention_2"}' \
  --ip 0.0.0.0 --port 8000

The vLLM backend provides higher throughput and is the recommended choice for serving multiple concurrent users.

The vLLM backend requires pip install -U "qwen-asr[vllm]". Make sure this extra is installed before launching with --backend vllm.

# Basic vLLM launch
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend vllm \
  --cuda-visible-devices 0 \
  --ip 0.0.0.0 --port 8000

Tune GPU memory usage via --backend-kwargs:

# Use 65% GPU memory with vLLM
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend vllm \
  --cuda-visible-devices 0 \
  --backend-kwargs '{"gpu_memory_utilization":0.65}' \
  --ip 0.0.0.0 --port 8000

Enabling Timestamps

Word- and character-level timestamps are available when --aligner-checkpoint is provided. The Gradio UI will show a timestamp visualization panel automatically; without the flag it is hidden.

# Transformers backend with forced aligner
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --aligner-checkpoint Qwen/Qwen3-ForcedAligner-0.6B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --backend-kwargs '{"device_map":"cuda:0","dtype":"bfloat16","max_inference_batch_size":8,"max_new_tokens":256}' \
  --aligner-kwargs '{"device_map":"cuda:0","dtype":"bfloat16"}' \
  --ip 0.0.0.0 --port 8000

# vLLM backend with forced aligner
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --aligner-checkpoint Qwen/Qwen3-ForcedAligner-0.6B \
  --backend vllm \
  --cuda-visible-devices 0 \
  --backend-kwargs '{"gpu_memory_utilization":0.7,"max_inference_batch_size":8,"max_new_tokens":2048}' \
  --aligner-kwargs '{"device_map":"cuda:0","dtype":"bfloat16"}' \
  --ip 0.0.0.0 --port 8000

For best aligner performance, install FlashAttention 2 first:

pip install -U flash-attn --no-build-isolation

HTTPS Setup

Modern browsers block microphone access on non-HTTPS pages when the origin is not localhost. To record audio remotely, serve the demo over HTTPS.

Microphone access requires a secure context (HTTPS or localhost). If you access the demo from a remote machine without HTTPS, the browser will silently deny permission and recording will not work.

Generate a self-signed certificate

Create a private key and a self-signed certificate valid for 365 days:

openssl req -x509 -newkey rsa:2048 \
  -keyout key.pem -out cert.pem \
  -days 365 -nodes \
  -subj "/CN=localhost"

Launch the demo with SSL flags

Pass the certificate and key files to qwen-asr-demo:

qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --ip 0.0.0.0 --port 8000 \
  --ssl-certfile cert.pem \
  --ssl-keyfile key.pem \
  --no-ssl-verify

Open the HTTPS URL

Navigate to https://<your-ip>:8000. Your browser will display a security warning for the self-signed certificate — click Advanced → Proceed to continue.For production deployments, replace the self-signed certificate with one issued by a trusted CA.

Streaming Demo

The qwen-asr-demo-streaming command launches a minimal Flask-based demo that captures microphone audio in the browser, resamples it to 16,000 Hz, and continuously pushes PCM chunks to the model for real-time transcription.

qwen-asr-demo-streaming \
  --asr-model-path Qwen/Qwen3-ASR-1.7B \
  --gpu-memory-utilization 0.9 \
  --host 0.0.0.0 \
  --port 8000

Then open http://<your-ip>:8000.

Streaming Demo Flags

Flag	Default	Description
`--asr-model-path`	`Qwen/Qwen3-ASR-1.7B`	Model name or local path.
`--gpu-memory-utilization`	`0.8`	vLLM GPU memory fraction (0.0–1.0).
`--host`	`0.0.0.0`	Bind host for the Flask server.
`--port`	`8000`	Bind port for the Flask server.

The streaming demo uses the vLLM backend exclusively and requires pip install -U "qwen-asr[vllm]". Streaming inference does not support batch processing or timestamp output.

CUDA Device Selection

Because vLLM does not respect the cuda:N device-string style, both demos control GPU selection by setting the CUDA_VISIBLE_DEVICES environment variable. Use --cuda-visible-devices to choose which physical GPU the process sees:

# Use GPU 0 (the first GPU)
--cuda-visible-devices 0

# Use GPU 1 (the second GPU)
--cuda-visible-devices 1

This applies to both the transformers and vllm backends in qwen-asr-demo.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Launch the Qwen3-ASR Gradio and Streaming Web Demos

Gradio Demo

Basic Usage

Flag Reference

Choosing a Backend

Enabling Timestamps

HTTPS Setup

Streaming Demo

Streaming Demo Flags

CUDA Device Selection

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​Gradio Demo

​Basic Usage

​Flag Reference

​Choosing a Backend

​Enabling Timestamps

​HTTPS Setup

​Streaming Demo

​Streaming Demo Flags

​CUDA Device Selection

Build docs developers (and LLMs) love

Gradio Demo

Basic Usage

Flag Reference

Choosing a Backend

Enabling Timestamps

HTTPS Setup

Streaming Demo

Streaming Demo Flags

CUDA Device Selection