Qwen3-ASR: Multilingual Open-Source Speech Recognition

Qwen3-ASR is an open-source automatic speech recognition (ASR) package from the Alibaba Qwen team. Built on the Qwen3-Omni foundation model, it delivers state-of-the-art transcription accuracy across 30 languages and 22 Chinese dialects, handles speech, singing voice, and songs with background music, and ships with a production-ready Python package (qwen-asr) that supports both HuggingFace Transformers and vLLM backends.

What Is Qwen3-ASR

Qwen3-ASR is a family of all-in-one ASR models that combine automatic language identification with high-accuracy speech recognition in a single forward pass. Released by Alibaba Qwen in January 2026, the qwen-asr PyPI package provides a unified Python API for loading any Qwen3-ASR checkpoint, running single or batch inference, streaming transcription, and generating word- or character-level timestamps via the companion forced-alignment model.

Qwen3-ASR-1.7B achieves state-of-the-art performance among open-source ASR models on public multilingual benchmarks and is competitive with the strongest proprietary commercial APIs — all with a fully open-source, self-hostable package.

Model Family

The Qwen3-ASR release includes three model checkpoints, each with a distinct purpose.

Qwen3-ASR-1.7B

The flagship model. Supports language identification and speech recognition for 30 languages and 22 Chinese dialects. Best accuracy; recommended for production workloads.

Qwen3-ASR-0.6B

A lightweight, efficiency-optimised model reaching 2,000× throughput at a concurrency of 128. Ideal for latency-sensitive or resource-constrained deployments.

Qwen3-ForcedAligner-0.6B

A non-autoregressive forced-alignment model that predicts word- or character-level timestamps for up to 3 minutes of speech across 11 languages.

Key Features

52 Languages & Dialects

Covers 30 languages including Chinese, English, Japanese, Arabic, and Hindi, plus 22 Chinese dialects such as Cantonese, Sichuan, Wu, and Minnan. Language is identified automatically — no need to specify it upfront.

All Audio Types

Transcribes standard speech, singing voice, and full songs with background music (BGM). Qwen3-ASR-1.7B is the only open-source model with competitive song transcription performance.

Offline & Streaming

Both offline batch inference and streaming inference are supported within a single model. Streaming is available via the vLLM backend and the qwen-asr-demo-streaming CLI command.

Word-Level Timestamps

Pair any ASR model with Qwen3-ForcedAligner-0.6B to get precise start and end times for every word or character. Timestamp accuracy surpasses existing E2E forced-alignment models.

Two Inference Backends

Choose between the HuggingFace Transformers backend for straightforward single-GPU usage, or the vLLM backend for maximum throughput, async serving, and OpenAI-compatible API endpoints.

Flexible Audio Inputs

Pass audio as a local file path, HTTPS URL, base64 string, or a (np.ndarray, sr) tuple. All inputs are automatically resampled to 16 kHz mono internally.

Architecture Overview

The Qwen3-ASR models are built on Qwen3-Omni, a multimodal foundation model. The audio encoder processes raw waveforms into acoustic features, which are fed into the language model decoder to produce a structured output containing the detected language and the transcribed text. This autoregressive design enables the model to use language-model reasoning to handle challenging acoustic conditions, accents, complex vocabulary, and mixed-language speech. Qwen3-ForcedAligner-0.6B uses a separate non-autoregressive (NAR) architecture. Given a speech segment and a known transcript, the aligner predicts the exact boundaries of each token within the audio, delivering millisecond-accurate timestamps without requiring a second full-inference pass.

The qwen-asr package abstracts over both backends and both model families. You use the same model.transcribe(...) call regardless of which backend or checkpoint you choose.

Next Steps

Quickstart

Transcribe your first audio file in under 5 minutes with a minimal code example.

Installation

Install qwen-asr via pip, set up a conda environment, or build from source.

Transformers Backend

Learn how to use the HuggingFace Transformers backend for single-GPU inference.

vLLM Backend

Unlock maximum throughput and streaming with the vLLM backend.

Forced Aligner

Generate word- and character-level timestamps with Qwen3-ForcedAligner-0.6B.

Model Reference

Full reference for all released checkpoints, supported languages, and inference modes.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Qwen3-ASR: Multilingual Open-Source Speech Recognition

What Is Qwen3-ASR

Model Family

Qwen3-ASR-1.7B

Qwen3-ASR-0.6B

Qwen3-ForcedAligner-0.6B

Key Features

52 Languages & Dialects

All Audio Types

Offline & Streaming

Word-Level Timestamps

Two Inference Backends

Flexible Audio Inputs

Architecture Overview

Next Steps

Quickstart

Installation

Transformers Backend

vLLM Backend

Forced Aligner

Model Reference

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​What Is Qwen3-ASR

​Model Family

Qwen3-ASR-1.7B

Qwen3-ASR-0.6B

Qwen3-ForcedAligner-0.6B

​Key Features

52 Languages & Dialects

All Audio Types

Offline & Streaming

Word-Level Timestamps

Two Inference Backends

Flexible Audio Inputs

​Architecture Overview

​Next Steps

Quickstart

Installation

Transformers Backend

vLLM Backend

Forced Aligner

Model Reference

Build docs developers (and LLMs) love

What Is Qwen3-ASR

Model Family

Key Features

Architecture Overview

Next Steps