Qwen3-ASR is a family of open-source automatic speech recognition models developed by the Qwen team at Alibaba Cloud. The series includesDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
Qwen3-ASR-1.7B and Qwen3-ASR-0.6B for multilingual speech, singing voice, and song transcription across 30 languages and 22 Chinese dialects, plus Qwen3-ForcedAligner-0.6B for word- and character-level timestamp prediction in 11 languages. Both ASR models support offline and streaming inference from a single checkpoint.
Quickstart
Transcribe your first audio file in under five minutes using the
qwen-asr Python package.Installation
Install the package with pip, set up a conda environment, and optionally enable FlashAttention 2.
Transformers Backend
Run offline batch inference with the HuggingFace Transformers backend and get timestamps.
vLLM Backend
Maximize throughput with the high-performance vLLM backend for production workloads.
Streaming Inference
Transcribe live audio in real time using the vLLM-powered streaming API.
Forced Aligner
Align existing transcripts to audio and obtain precise per-word timestamps.
Model Reference
Compare Qwen3-ASR-1.7B, 0.6B, and Qwen3-ForcedAligner model capabilities and downloads.
Fine-Tuning
Fine-tune Qwen3-ASR on your own audio data with single-GPU or multi-GPU training.
Key Features
52 Languages
30 spoken languages plus 22 Chinese dialects including Cantonese, Sichuan, and Wu.
Language Detection
Automatic language identification alongside transcription — no manual language tag required.
Offline & Streaming
A single model checkpoint handles both offline batch and real-time streaming transcription.
Music & Song Support
Transcribes singing voice and full songs with background music, not just clean speech.
OpenAI-Compatible API
vLLM serving exposes an OpenAI-compatible endpoint for easy integration with existing tooling.
Word Timestamps
Qwen3-ForcedAligner produces character- or word-level timestamps for 11 languages.
Quick Example
Getting Started
Transcribe audio
Pass a local file path, URL, base64 string, or a
(numpy_array, sample_rate) tuple to transcribe().Enable timestamps (optional)
Add the
Qwen3-ForcedAligner to get word-level timestamps alongside transcription.
See the Forced Aligner guide for details.