Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

The Qwen3-ASR family consists of three released models built on the Qwen3-Omni foundation. Two autoregressive ASR models handle speech recognition, singing voice, and full songs across 52 languages and dialects, while a third non-autoregressive model provides precise timestamp alignment. This page describes each model’s characteristics, how to download them, and how to load them from a local directory.

Model Comparison

Qwen3-ASR-1.7B

1.7 billion parameters. State-of-the-art open-source ASR. Recommended when transcription quality is the priority. Supports offline and streaming inference.

Qwen3-ASR-0.6B

0.6 billion parameters. Accuracy-efficiency trade-off. Reaches 2000× throughput at concurrency 128. Supports offline and streaming inference.

Qwen3-ForcedAligner-0.6B

0.6 billion parameters. Non-autoregressive forced aligner. Returns word- or character-level timestamps for up to 5 minutes of speech in 11 languages.
ModelParamsInference ModeAudio TypesHuggingFace
Qwen3-ASR-1.7B1.7BOffline + StreamingSpeech, Singing Voice, Songs with BGMQwen/Qwen3-ASR-1.7B
Qwen3-ASR-0.6B0.6BOffline + StreamingSpeech, Singing Voice, Songs with BGMQwen/Qwen3-ASR-0.6B
Qwen3-ForcedAligner-0.6B0.6BNAR (Non-autoregressive)SpeechQwen/Qwen3-ForcedAligner-0.6B

Qwen3-ASR-1.7B

Qwen3-ASR-1.7B is the flagship model in the series. It achieves state-of-the-art performance among open-source ASR models and is competitive with the strongest proprietary commercial APIs across English, Chinese, and multilingual benchmarks. Key highlights:
  • English: 1.63 / 3.38 WER on LibriSpeech clean/other — best published result among open-source models on the other split.
  • Chinese: 4.97 / 5.88 WER on WenetSpeech net/meeting — top performance across Mandarin test sets.
  • Multilingual: 8.55 average WER across 8 MLS languages; 4.90 across 12 Fleurs languages.
  • Singing: 5.98 WER on M4Singer — best among all compared models, including proprietary APIs.
  • Language ID: 97.9% average accuracy across MLS, CommonVoice, MLC-SLM, and Fleurs.
  • Streaming: 1.95 / 4.51 WER on LibriSpeech in streaming mode — same single model, no re-training needed.
Use Qwen3-ASR-1.7B when transcription accuracy is your primary concern, especially for challenging conditions such as accented speech, dialects, noisy audio, tongue twisters, or singing.

Qwen3-ASR-0.6B

Qwen3-ASR-0.6B is optimized for high-throughput scenarios where efficiency matters. It shares the same architecture and feature set as the 1.7B model but at a fraction of the compute cost. Key highlights:
  • Throughput: Reaches 2000× real-time throughput at a concurrency of 128 requests, making it suitable for large-scale batch processing pipelines.
  • Quality: Still competitive with Whisper-large-v3 on most benchmarks.
  • Streaming: Fully supports unified offline and streaming inference with a single model.
  • Audio types: Handles speech, singing voice, and songs with background music — identical to the 1.7B model.
Use Qwen3-ASR-0.6B for production services with high concurrency requirements, or when GPU memory is constrained and throughput is more important than marginal accuracy gains.

Qwen3-ForcedAligner-0.6B

Qwen3-ForcedAligner-0.6B is a novel non-autoregressive (NAR) model that aligns a known text transcript to its corresponding audio and returns word- or character-level timestamps. It is not a general-purpose transcription model — it requires both audio and text as input. Key highlights:
  • Languages supported: Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish (11 languages).
  • Audio length: Handles up to 5 minutes of speech per call.
  • Accuracy: Average AAS of 42.9 ms on MFA-Labeled Raw benchmarks — far surpassing Monotonic-Aligner (161.1 ms), NFA (129.8 ms), and WhisperX (133.2 ms).
  • Long audio: On MFA-Labeled Concat-300s (5-minute clips), achieves 52.9 ms average AAS versus 2708.4 ms for WhisperX and 246.7 ms for NFA.
The ForcedAligner requires that you already have the transcript text. It aligns the text to the audio rather than generating the transcript from scratch. To get both a transcript and timestamps in one call, pass forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B" when loading Qwen3ASRModel.

Downloading Models

Model weights are downloaded automatically when you use a HuggingFace repository ID in the qwen-asr package or vLLM. If your runtime environment does not allow downloads during execution, use the commands below to manually save the weights to a local directory.
pip install -U "huggingface_hub[cli]"

huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir ./Qwen3-ASR-1.7B
huggingface-cli download Qwen/Qwen3-ASR-0.6B --local-dir ./Qwen3-ASR-0.6B
huggingface-cli download Qwen/Qwen3-ForcedAligner-0.6B --local-dir ./Qwen3-ForcedAligner-0.6B

Loading from a Local Directory

Once the weights are on disk, pass the local path instead of the HuggingFace repository ID. The API is identical.
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "./Qwen3-ASR-1.7B",       # local directory
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="path/to/audio.wav",
    language=None,            # None = automatic language detection
)

print(results[0].language)
print(results[0].text)

Build docs developers (and LLMs) love