Qwen3-ASR is an open-source automatic speech recognition (ASR) package from the Alibaba Qwen team. Built on the Qwen3-Omni foundation model, it delivers state-of-the-art transcription accuracy across 30 languages and 22 Chinese dialects, handles speech, singing voice, and songs with background music, and ships with a production-ready Python package (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
qwen-asr) that supports both HuggingFace Transformers and vLLM backends.
What Is Qwen3-ASR
Qwen3-ASR is a family of all-in-one ASR models that combine automatic language identification with high-accuracy speech recognition in a single forward pass. Released by Alibaba Qwen in January 2026, theqwen-asr PyPI package provides a unified Python API for loading any Qwen3-ASR checkpoint, running single or batch inference, streaming transcription, and generating word- or character-level timestamps via the companion forced-alignment model.
Qwen3-ASR-1.7B achieves state-of-the-art performance among open-source ASR models on public multilingual benchmarks and is competitive with the strongest proprietary commercial APIs — all with a fully open-source, self-hostable package.
Model Family
The Qwen3-ASR release includes three model checkpoints, each with a distinct purpose.Qwen3-ASR-1.7B
The flagship model. Supports language identification and speech recognition for 30 languages and 22 Chinese dialects. Best accuracy; recommended for production workloads.
Qwen3-ASR-0.6B
A lightweight, efficiency-optimised model reaching 2,000× throughput at a concurrency of 128. Ideal for latency-sensitive or resource-constrained deployments.
Qwen3-ForcedAligner-0.6B
A non-autoregressive forced-alignment model that predicts word- or character-level timestamps for up to 3 minutes of speech across 11 languages.
Key Features
52 Languages & Dialects
Covers 30 languages including Chinese, English, Japanese, Arabic, and Hindi, plus 22 Chinese dialects such as Cantonese, Sichuan, Wu, and Minnan. Language is identified automatically — no need to specify it upfront.
All Audio Types
Transcribes standard speech, singing voice, and full songs with background music (BGM). Qwen3-ASR-1.7B is the only open-source model with competitive song transcription performance.
Offline & Streaming
Both offline batch inference and streaming inference are supported within a single model. Streaming is available via the vLLM backend and the
qwen-asr-demo-streaming CLI command.Word-Level Timestamps
Pair any ASR model with
Qwen3-ForcedAligner-0.6B to get precise start and end times for every word or character. Timestamp accuracy surpasses existing E2E forced-alignment models.Two Inference Backends
Choose between the HuggingFace Transformers backend for straightforward single-GPU usage, or the vLLM backend for maximum throughput, async serving, and OpenAI-compatible API endpoints.
Flexible Audio Inputs
Pass audio as a local file path, HTTPS URL, base64 string, or a
(np.ndarray, sr) tuple. All inputs are automatically resampled to 16 kHz mono internally.Architecture Overview
The Qwen3-ASR models are built on Qwen3-Omni, a multimodal foundation model. The audio encoder processes raw waveforms into acoustic features, which are fed into the language model decoder to produce a structured output containing the detected language and the transcribed text. This autoregressive design enables the model to use language-model reasoning to handle challenging acoustic conditions, accents, complex vocabulary, and mixed-language speech.Qwen3-ForcedAligner-0.6B uses a separate non-autoregressive (NAR) architecture. Given a speech segment and a known transcript, the aligner predicts the exact boundaries of each token within the audio, delivering millisecond-accurate timestamps without requiring a second full-inference pass.
Next Steps
Quickstart
Transcribe your first audio file in under 5 minutes with a minimal code example.
Installation
Install
qwen-asr via pip, set up a conda environment, or build from source.Transformers Backend
Learn how to use the HuggingFace Transformers backend for single-GPU inference.
vLLM Backend
Unlock maximum throughput and streaming with the vLLM backend.
Forced Aligner
Generate word- and character-level timestamps with Qwen3-ForcedAligner-0.6B.
Model Reference
Full reference for all released checkpoints, supported languages, and inference modes.