Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
This page presents the full evaluation results for Qwen3-ASR-1.7B, Qwen3-ASR-0.6B, and Qwen3-ForcedAligner-0.6B across public and internal benchmarks. Results span English ASR, Chinese Mandarin and dialect ASR, multilingual ASR, language identification, singing voice and song transcription, and forced alignment accuracy.
Evaluation Methodology
All ASR inference was conducted with the following settings:
All models were evaluated with dtype=torch.bfloat16 and max_new_tokens=1024 using the vLLM backend. Greedy decoding was applied throughout. No language parameter was specified — all language identification was performed automatically by the model. Whisper-large-v3 and other open-source baselines were evaluated under equivalent conditions where possible.
| Setting | Value |
|---|
| Precision | torch.bfloat16 |
| Backend | vLLM |
| Decoding | Greedy search |
| Language parameter | None (automatic detection) |
max_new_tokens | 1024 |
English ASR Benchmarks (WER ↓)
Lower WER is better. Bold indicates the best result among the models compared.
| Dataset | GPT-4o-Transcribe | Gemini-2.5-Pro | Doubao-ASR | Whisper-large-v3 | Fun-ASR-MLT-Nano | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|
| LibriSpeech clean | other | 1.39 | 3.75 | 2.89 | 3.56 | 2.78 | 5.70 | 1.51 | 3.97 | 1.68 | 4.03 | 2.11 | 4.55 | 1.63 | 3.38 |
| GigaSpeech | 25.50 | 9.37 | 9.55 | 9.76 | — | 8.88 | 8.45 |
| CV-en | 9.08 | 14.49 | 13.78 | 9.90 | 9.90 | 9.92 | 7.39 |
| Fleurs-en | 2.40 | 2.94 | 6.31 | 4.08 | 5.49 | 4.39 | 3.35 |
| MLS-en | 5.12 | 3.68 | 7.09 | 4.87 | — | 6.00 | 4.58 |
| Tedlium | 7.69 | 6.15 | 4.91 | 6.84 | — | 3.85 | 4.50 |
| VoxPopuli | 10.29 | 11.36 | 12.12 | 12.05 | — | 9.96 | 9.15 |
Qwen3-ASR-1.7B achieves the best published result on LibriSpeech other, GigaSpeech, and CV-en among open-source models. Qwen3-ASR-0.6B leads on Tedlium and VoxPopuli.
Chinese ASR Benchmarks (WER ↓)
| Dataset | GPT-4o-Transcribe | Gemini-2.5-Pro | Doubao-ASR | Whisper-large-v3 | Fun-ASR-MLT-Nano | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|
| WenetSpeech net | meeting | 15.30 | 32.27 | 14.43 | 13.47 | N/A | 9.86 | 19.11 | 6.35 | — | 5.97 | 6.88 | 4.97 | 5.88 |
| AISHELL-2-test | 4.24 | 11.62 | 2.85 | 5.06 | — | 3.15 | 2.71 |
| SpeechIO | 12.86 | 5.30 | 2.93 | 7.56 | — | 3.44 | 2.88 |
| Fleurs-zh | 2.44 | 2.71 | 2.69 | 4.09 | 3.51 | 2.88 | 2.41 |
| CV-zh | 6.32 | 7.70 | 5.95 | 12.91 | 6.20 | 6.89 | 5.35 |
Chinese Dialect Benchmarks (WER ↓)
| Dataset | GPT-4o-Transcribe | Gemini-2.5-Pro | Doubao-ASR | Whisper-large-v3 | Fun-ASR-MLT-Nano | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|
| KeSpeech | 26.87 | 24.71 | 5.27 | 28.79 | — | 7.08 | 5.10 |
| Fleurs-yue | 4.98 | 9.43 | 4.98 | 9.18 | — | 5.79 | 3.98 |
| CV-yue | 11.36 | 18.76 | 13.20 | 16.23 | — | 9.50 | 7.57 |
| CV-zh-tw (Traditional) | 6.32 | 7.31 | 4.06 | 7.84 | — | 5.59 | 3.77 |
| WenetSpeech-Yue short | long | 15.62 | 25.29 | 25.19 | 11.23 | 9.74 | 11.40 | 32.26 | 46.64 | — | — | 7.54 | 9.92 | 5.82 | 8.85 |
| WenetSpeech-Chuan easy | hard | 34.81 | 53.98 | 43.79 | 67.30 | 11.40 | 20.20 | 14.35 | 26.80 | — | — | 13.92 | 24.45 | 11.99 | 21.63 |
Internal Chinese Benchmarks (WER ↓)
| Dataset | GPT-4o-Transcribe | Gemini-2.5-Pro | Doubao-ASR | Whisper-large-v3 | Fun-ASR-MLT-Nano | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|
| Elders & Kids | 14.27 | 36.93 | 4.17 | 10.61 | 4.54 | 4.48 | 3.81 |
| ExtremeNoise | 36.11 | 29.06 | 17.04 | 63.17 | 36.55 | 17.88 | 16.17 |
| TongueTwister | 20.87 | 4.97 | 3.47 | 16.63 | 9.02 | 4.06 | 2.44 |
| Dialog-Mandarin | 20.73 | 12.50 | 6.61 | 14.01 | 7.32 | 7.06 | 6.54 |
| Dialog-Cantonese | 16.05 | 14.98 | 7.56 | 31.04 | 5.85 | 4.80 | 4.12 |
| Dialog-Chinese Dialects (avg. 22 dialects) | 45.37 | 47.70 | 19.85 | 44.55 | 19.41 | 18.24 | 15.94 |
Accented English (WER ↓)
| Dataset | GPT-4o-Transcribe | Gemini-2.5-Pro | Doubao-ASR | Whisper-large-v3 | Fun-ASR-MLT-Nano | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|
| Dialog-Accented English (avg. 16 accents) | 28.56 | 23.85 | 20.41 | 21.30 | 19.96 | 16.62 | 16.07 |
Multilingual ASR Benchmarks (WER ↓)
The following results compare Qwen3-ASR against GLM-ASR-Nano-2512, Whisper-large-v3, and Fun-ASR-MLT-Nano across open-source and internal multilingual benchmarks.
| Dataset | Languages | GLM-ASR-Nano-2512 | Whisper-large-v3 | Fun-ASR-MLT-Nano | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|
| MLS | 8 (da, de, en, es, fr, it, pl, pt) | 13.32 | 8.62 | 28.70 | 13.19 | 8.55 |
| CommonVoice | 13 (en, zh, yue, zh-TW, ar, de, es, fr, it, ja, ko, pt, ru) | 19.40 | 10.77 | 17.25 | 12.75 | 9.18 |
| MLC-SLM | 11 (en, fr, de, it, pt, es, ja, ko, ru, th, vi) | 34.93 | 15.68 | 29.94 | 15.84 | 12.74 |
| Fleurs (12 languages) | en, zh, yue, ar, de, es, fr, it, ja, ko, pt, ru | 16.08 | 5.27 | 10.03 | 7.57 | 4.90 |
| Fleurs† (8 additional languages) | hi, id, ms, nl, pl, th, tr, vi | 20.05 | 6.85 | 31.89 | 10.37 | 6.62 |
| Fleurs†† (10 further languages) | cs, da, el, fa, fi, fil, hu, mk, ro, sv | 24.83 | 8.16 | 47.84 | 21.80 | 12.60 |
| News-Multilingual (internal) | 15 languages | 49.40 | 14.80 | 65.07 | 17.39 | 12.80 |
Language Identification Accuracy (% ↑)
Higher accuracy is better.
| Benchmark | Whisper-large-v3 | Qwen3-ASR-0.6B | Qwen3-ASR-1.7B |
|---|
| MLS | 99.9 | 99.3 | 99.9 |
| CommonVoice | 92.7 | 98.2 | 98.7 |
| MLC-SLM | 89.2 | 92.7 | 94.1 |
| Fleurs (30 languages) | 94.6 | 97.1 | 98.7 |
| Average | 94.1 | 96.8 | 97.9 |
Qwen3-ASR-1.7B achieves 97.9% average language identification accuracy. Qwen3-ASR-0.6B achieves 96.8%, outperforming Whisper-large-v3 on all four benchmarks.
Singing Voice and Song Transcription (WER ↓)
Qwen3-ASR-1.7B is evaluated on singing voice (isolated vocals) and full songs with background music (BGM). Results for Qwen3-ASR-0.6B are not available on these benchmarks.
| Dataset | GPT-4o-Transcribe | Gemini-2.5-Pro | Doubao-ASR-1.0 | Whisper-large-v3 | Fun-ASR-MLT-Nano | Qwen3-ASR-1.7B |
|---|
| Singing (isolated vocals) | | | | | | |
| M4Singer | 16.77 | 20.88 | 7.88 | 13.58 | 7.29 | 5.98 |
| MIR-1k-vocal | 11.87 | 9.85 | 6.56 | 11.71 | 8.17 | 6.25 |
| Opencpop | 7.93 | 6.49 | 3.80 | 9.52 | 2.98 | 3.08 |
| Popcs | 32.84 | 15.13 | 8.97 | 13.77 | 9.42 | 8.52 |
| Songs with BGM | | | | | | |
| EntireSongs-en | 30.71 | 12.18 | 33.51 | N/A | N/A | 14.60 |
| EntireSongs-zh | 34.86 | 18.68 | 23.99 | N/A | N/A | 13.91 |
Both ASR models use a single set of weights for both offline and streaming inference.
| Model | Mode | LibriSpeech clean | other | Fleurs-en | Fleurs-zh | Avg. |
|---|
| Qwen3-ASR-1.7B | Offline | 1.63 | 3.38 | 3.35 | 2.41 | 2.69 |
| Qwen3-ASR-1.7B | Streaming | 1.95 | 4.51 | 4.02 | 2.84 | 3.33 |
| Qwen3-ASR-0.6B | Offline | 2.11 | 4.55 | 4.39 | 2.88 | 3.48 |
| Qwen3-ASR-0.6B | Streaming | 2.54 | 6.27 | 5.38 | 3.40 | 4.40 |
Forced Alignment Benchmarks (AAS ms ↓)
AAS (Average Absolute error in Seconds, reported in milliseconds) measures the accuracy of predicted timestamps against reference alignments. Lower is better.
MFA-Labeled Raw
| Language | Monotonic-Aligner | NFA | WhisperX | Qwen3-ForcedAligner-0.6B |
|---|
| Chinese | 161.1 | 109.8 | — | 33.1 |
| English | — | 107.5 | 92.1 | 37.5 |
| French | — | 100.7 | 145.3 | 41.7 |
| German | — | 122.7 | 165.1 | 46.5 |
| Italian | — | 142.7 | 155.5 | 75.5 |
| Japanese | — | — | — | 42.2 |
| Korean | — | — | — | 37.2 |
| Portuguese | — | — | — | 38.4 |
| Russian | — | 200.7 | — | 40.2 |
| Spanish | — | 124.7 | 108.0 | 36.8 |
| Avg. | 161.1 | 129.8 | 133.2 | 42.9 |
MFA-Labeled Concat-300s (5-minute clips)
| Language | Monotonic-Aligner | NFA | WhisperX | Qwen3-ForcedAligner-0.6B |
|---|
| Chinese | 1742.4 | 235.0 | — | 36.5 |
| English | — | 226.7 | 227.2 | 58.6 |
| French | — | 230.6 | 2052.2 | 53.4 |
| German | — | 220.3 | 993.4 | 62.4 |
| Italian | — | 290.5 | 5719.4 | 81.6 |
| Japanese | — | — | — | 81.3 |
| Korean | — | — | — | 42.2 |
| Portuguese | — | — | — | 50.0 |
| Russian | — | 283.3 | — | 43.0 |
| Spanish | — | 240.2 | 4549.9 | 39.6 |
| Cross-lingual | — | — | — | 34.2 |
| Avg. | 1742.4 | 246.7 | 2708.4 | 52.9 |
Human-Labeled (Chinese)
| Condition | Monotonic-Aligner | NFA | Qwen3-ForcedAligner-0.6B |
|---|
| Raw | 49.9 | 88.6 | 27.8 |
| Raw-Noisy | 53.3 | 89.5 | 41.8 |
| Concat-60s | 51.1 | 86.7 | 25.3 |
| Concat-300s | 410.8 | 140.0 | 24.8 |
| Concat-Cross-lingual | — | — | 42.5 |
| Avg. | 141.3 | 101.2 | 32.4 |