Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

The qwen-asr package ships two ready-to-run web demos: a full-featured Gradio interface (qwen-asr-demo) and a minimal Flask streaming demo (qwen-asr-demo-streaming). Both let you transcribe audio from a browser without writing any application code.

Gradio Demo

The qwen-asr-demo command launches a Gradio web UI backed by either the transformers or vLLM inference engine. It supports file uploads, optional timestamp visualization, and HTTPS.

Basic Usage

# Minimal launch — transformers backend on GPU 0
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --ip 0.0.0.0 --port 8000
Then open http://<your-ip>:8000 in a browser, or use port forwarding in VS Code.

Flag Reference

FlagDefaultDescription
--asr-checkpoint(required)Qwen3-ASR model checkpoint path or Hugging Face repo ID.
--aligner-checkpointNoneQwen3-ForcedAligner checkpoint (enables timestamps when provided).
--backendtransformersInference backend: transformers or vllm.
--cuda-visible-devices0GPU index to expose to the demo process.
--backend-kwargsNoneJSON dict of backend-specific init arguments.
--aligner-kwargsNoneJSON dict of forced aligner init arguments.
--ip0.0.0.0Server bind address.
--port8000Server port.
--ssl-certfileNonePath to SSL certificate file (enables HTTPS).
--ssl-keyfileNonePath to SSL private key file (enables HTTPS).
--no-ssl-verifyDisable SSL certificate verification (useful for self-signed certs).
--sharefalseCreate a public Gradio share link (disabled by default).
--concurrency16Gradio queue concurrency limit.

Choosing a Backend

All backend-specific initialization parameters are passed via --backend-kwargs as a JSON string. If not provided, the demo uses sensible defaults.
The transformers backend is the simplest to set up and is recommended for development or single-GPU use.
# Basic transformers launch
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --ip 0.0.0.0 --port 8000
Override init arguments with --backend-kwargs:
# Enable FlashAttention 2 for reduced memory usage
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --backend-kwargs '{"device_map":"cuda:0","dtype":"bfloat16","attn_implementation":"flash_attention_2"}' \
  --ip 0.0.0.0 --port 8000

Enabling Timestamps

Word- and character-level timestamps are available when --aligner-checkpoint is provided. The Gradio UI will show a timestamp visualization panel automatically; without the flag it is hidden.
# Transformers backend with forced aligner
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --aligner-checkpoint Qwen/Qwen3-ForcedAligner-0.6B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --backend-kwargs '{"device_map":"cuda:0","dtype":"bfloat16","max_inference_batch_size":8,"max_new_tokens":256}' \
  --aligner-kwargs '{"device_map":"cuda:0","dtype":"bfloat16"}' \
  --ip 0.0.0.0 --port 8000

# vLLM backend with forced aligner
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --aligner-checkpoint Qwen/Qwen3-ForcedAligner-0.6B \
  --backend vllm \
  --cuda-visible-devices 0 \
  --backend-kwargs '{"gpu_memory_utilization":0.7,"max_inference_batch_size":8,"max_new_tokens":2048}' \
  --aligner-kwargs '{"device_map":"cuda:0","dtype":"bfloat16"}' \
  --ip 0.0.0.0 --port 8000
For best aligner performance, install FlashAttention 2 first:
pip install -U flash-attn --no-build-isolation

HTTPS Setup

Modern browsers block microphone access on non-HTTPS pages when the origin is not localhost. To record audio remotely, serve the demo over HTTPS.
Microphone access requires a secure context (HTTPS or localhost). If you access the demo from a remote machine without HTTPS, the browser will silently deny permission and recording will not work.
1

Generate a self-signed certificate

Create a private key and a self-signed certificate valid for 365 days:
openssl req -x509 -newkey rsa:2048 \
  -keyout key.pem -out cert.pem \
  -days 365 -nodes \
  -subj "/CN=localhost"
2

Launch the demo with SSL flags

Pass the certificate and key files to qwen-asr-demo:
qwen-asr-demo \
  --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
  --backend transformers \
  --cuda-visible-devices 0 \
  --ip 0.0.0.0 --port 8000 \
  --ssl-certfile cert.pem \
  --ssl-keyfile key.pem \
  --no-ssl-verify
3

Open the HTTPS URL

Navigate to https://<your-ip>:8000. Your browser will display a security warning for the self-signed certificate — click Advanced → Proceed to continue.For production deployments, replace the self-signed certificate with one issued by a trusted CA.

Streaming Demo

The qwen-asr-demo-streaming command launches a minimal Flask-based demo that captures microphone audio in the browser, resamples it to 16,000 Hz, and continuously pushes PCM chunks to the model for real-time transcription.
qwen-asr-demo-streaming \
  --asr-model-path Qwen/Qwen3-ASR-1.7B \
  --gpu-memory-utilization 0.9 \
  --host 0.0.0.0 \
  --port 8000
Then open http://<your-ip>:8000.

Streaming Demo Flags

FlagDefaultDescription
--asr-model-pathQwen/Qwen3-ASR-1.7BModel name or local path.
--gpu-memory-utilization0.8vLLM GPU memory fraction (0.0–1.0).
--host0.0.0.0Bind host for the Flask server.
--port8000Bind port for the Flask server.
The streaming demo uses the vLLM backend exclusively and requires pip install -U "qwen-asr[vllm]". Streaming inference does not support batch processing or timestamp output.

CUDA Device Selection

Because vLLM does not respect the cuda:N device-string style, both demos control GPU selection by setting the CUDA_VISIBLE_DEVICES environment variable. Use --cuda-visible-devices to choose which physical GPU the process sees:
# Use GPU 0 (the first GPU)
--cuda-visible-devices 0

# Use GPU 1 (the second GPU)
--cuda-visible-devices 1
This applies to both the transformers and vllm backends in qwen-asr-demo.

Build docs developers (and LLMs) love