Qwen3-ASR Benchmark Results and Evaluation Methodology

This page presents the full evaluation results for Qwen3-ASR-1.7B, Qwen3-ASR-0.6B, and Qwen3-ForcedAligner-0.6B across public and internal benchmarks. Results span English ASR, Chinese Mandarin and dialect ASR, multilingual ASR, language identification, singing voice and song transcription, and forced alignment accuracy.

Evaluation Methodology

All ASR inference was conducted with the following settings:

All models were evaluated with dtype=torch.bfloat16 and max_new_tokens=1024 using the vLLM backend. Greedy decoding was applied throughout. No language parameter was specified — all language identification was performed automatically by the model. Whisper-large-v3 and other open-source baselines were evaluated under equivalent conditions where possible.

Setting	Value
Precision	`torch.bfloat16`
Backend	vLLM
Decoding	Greedy search
Language parameter	None (automatic detection)
`max_new_tokens`	1024

English ASR Benchmarks (WER ↓)

Lower WER is better. Bold indicates the best result among the models compared.

Dataset	GPT-4o-Transcribe	Gemini-2.5-Pro	Doubao-ASR	Whisper-large-v3	Fun-ASR-MLT-Nano	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
LibriSpeech clean \| other	1.39 \| 3.75	2.89 \| 3.56	2.78 \| 5.70	1.51 \| 3.97	1.68 \| 4.03	2.11 \| 4.55	1.63 \| 3.38
GigaSpeech	25.50	9.37	9.55	9.76	—	8.88	8.45
CV-en	9.08	14.49	13.78	9.90	9.90	9.92	7.39
Fleurs-en	2.40	2.94	6.31	4.08	5.49	4.39	3.35
MLS-en	5.12	3.68	7.09	4.87	—	6.00	4.58
Tedlium	7.69	6.15	4.91	6.84	—	3.85	4.50
VoxPopuli	10.29	11.36	12.12	12.05	—	9.96	9.15

Qwen3-ASR-1.7B achieves the best published result on LibriSpeech other, GigaSpeech, and CV-en among open-source models. Qwen3-ASR-0.6B leads on Tedlium and VoxPopuli.

Chinese ASR Benchmarks (WER ↓)

Dataset	GPT-4o-Transcribe	Gemini-2.5-Pro	Doubao-ASR	Whisper-large-v3	Fun-ASR-MLT-Nano	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
WenetSpeech net \| meeting	15.30 \| 32.27	14.43 \| 13.47	N/A	9.86 \| 19.11	6.35 \| —	5.97 \| 6.88	4.97 \| 5.88
AISHELL-2-test	4.24	11.62	2.85	5.06	—	3.15	2.71
SpeechIO	12.86	5.30	2.93	7.56	—	3.44	2.88
Fleurs-zh	2.44	2.71	2.69	4.09	3.51	2.88	2.41
CV-zh	6.32	7.70	5.95	12.91	6.20	6.89	5.35

Chinese Dialect Benchmarks (WER ↓)

Dataset	GPT-4o-Transcribe	Gemini-2.5-Pro	Doubao-ASR	Whisper-large-v3	Fun-ASR-MLT-Nano	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
KeSpeech	26.87	24.71	5.27	28.79	—	7.08	5.10
Fleurs-yue	4.98	9.43	4.98	9.18	—	5.79	3.98
CV-yue	11.36	18.76	13.20	16.23	—	9.50	7.57
CV-zh-tw (Traditional)	6.32	7.31	4.06	7.84	—	5.59	3.77
WenetSpeech-Yue short \| long	15.62 \| 25.29	25.19 \| 11.23	9.74 \| 11.40	32.26 \| 46.64	— \| —	7.54 \| 9.92	5.82 \| 8.85
WenetSpeech-Chuan easy \| hard	34.81 \| 53.98	43.79 \| 67.30	11.40 \| 20.20	14.35 \| 26.80	— \| —	13.92 \| 24.45	11.99 \| 21.63

Internal Chinese Benchmarks (WER ↓)

Dataset	GPT-4o-Transcribe	Gemini-2.5-Pro	Doubao-ASR	Whisper-large-v3	Fun-ASR-MLT-Nano	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
Elders & Kids	14.27	36.93	4.17	10.61	4.54	4.48	3.81
ExtremeNoise	36.11	29.06	17.04	63.17	36.55	17.88	16.17
TongueTwister	20.87	4.97	3.47	16.63	9.02	4.06	2.44
Dialog-Mandarin	20.73	12.50	6.61	14.01	7.32	7.06	6.54
Dialog-Cantonese	16.05	14.98	7.56	31.04	5.85	4.80	4.12
Dialog-Chinese Dialects (avg. 22 dialects)	45.37	47.70	19.85	44.55	19.41	18.24	15.94

Accented English (WER ↓)

Dataset	GPT-4o-Transcribe	Gemini-2.5-Pro	Doubao-ASR	Whisper-large-v3	Fun-ASR-MLT-Nano	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
Dialog-Accented English (avg. 16 accents)	28.56	23.85	20.41	21.30	19.96	16.62	16.07

Multilingual ASR Benchmarks (WER ↓)

The following results compare Qwen3-ASR against GLM-ASR-Nano-2512, Whisper-large-v3, and Fun-ASR-MLT-Nano across open-source and internal multilingual benchmarks.

Dataset	Languages	GLM-ASR-Nano-2512	Whisper-large-v3	Fun-ASR-MLT-Nano	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
MLS	8 (da, de, en, es, fr, it, pl, pt)	13.32	8.62	28.70	13.19	8.55
CommonVoice	13 (en, zh, yue, zh-TW, ar, de, es, fr, it, ja, ko, pt, ru)	19.40	10.77	17.25	12.75	9.18
MLC-SLM	11 (en, fr, de, it, pt, es, ja, ko, ru, th, vi)	34.93	15.68	29.94	15.84	12.74
Fleurs (12 languages)	en, zh, yue, ar, de, es, fr, it, ja, ko, pt, ru	16.08	5.27	10.03	7.57	4.90
Fleurs† (8 additional languages)	hi, id, ms, nl, pl, th, tr, vi	20.05	6.85	31.89	10.37	6.62
Fleurs†† (10 further languages)	cs, da, el, fa, fi, fil, hu, mk, ro, sv	24.83	8.16	47.84	21.80	12.60
News-Multilingual (internal)	15 languages	49.40	14.80	65.07	17.39	12.80

Language Identification Accuracy (% ↑)

Higher accuracy is better.

Benchmark	Whisper-large-v3	Qwen3-ASR-0.6B	Qwen3-ASR-1.7B
MLS	99.9	99.3	99.9
CommonVoice	92.7	98.2	98.7
MLC-SLM	89.2	92.7	94.1
Fleurs (30 languages)	94.6	97.1	98.7
Average	94.1	96.8	97.9

Qwen3-ASR-1.7B achieves 97.9% average language identification accuracy. Qwen3-ASR-0.6B achieves 96.8%, outperforming Whisper-large-v3 on all four benchmarks.

Singing Voice and Song Transcription (WER ↓)

Qwen3-ASR-1.7B is evaluated on singing voice (isolated vocals) and full songs with background music (BGM). Results for Qwen3-ASR-0.6B are not available on these benchmarks.

Dataset	GPT-4o-Transcribe	Gemini-2.5-Pro	Doubao-ASR-1.0	Whisper-large-v3	Fun-ASR-MLT-Nano	Qwen3-ASR-1.7B
Singing (isolated vocals)
M4Singer	16.77	20.88	7.88	13.58	7.29	5.98
MIR-1k-vocal	11.87	9.85	6.56	11.71	8.17	6.25
Opencpop	7.93	6.49	3.80	9.52	2.98	3.08
Popcs	32.84	15.13	8.97	13.77	9.42	8.52
Songs with BGM
EntireSongs-en	30.71	12.18	33.51	N/A	N/A	14.60
EntireSongs-zh	34.86	18.68	23.99	N/A	N/A	13.91

Streaming vs. Offline Performance (WER ↓)

Both ASR models use a single set of weights for both offline and streaming inference.

Model	Mode	LibriSpeech clean \| other	Fleurs-en	Fleurs-zh	Avg.
Qwen3-ASR-1.7B	Offline	1.63 \| 3.38	3.35	2.41	2.69
Qwen3-ASR-1.7B	Streaming	1.95 \| 4.51	4.02	2.84	3.33
Qwen3-ASR-0.6B	Offline	2.11 \| 4.55	4.39	2.88	3.48
Qwen3-ASR-0.6B	Streaming	2.54 \| 6.27	5.38	3.40	4.40

Forced Alignment Benchmarks (AAS ms ↓)

AAS (Average Absolute error in Seconds, reported in milliseconds) measures the accuracy of predicted timestamps against reference alignments. Lower is better.

MFA-Labeled Raw

Language	Monotonic-Aligner	NFA	WhisperX	Qwen3-ForcedAligner-0.6B
Chinese	161.1	109.8	—	33.1
English	—	107.5	92.1	37.5
French	—	100.7	145.3	41.7
German	—	122.7	165.1	46.5
Italian	—	142.7	155.5	75.5
Japanese	—	—	—	42.2
Korean	—	—	—	37.2
Portuguese	—	—	—	38.4
Russian	—	200.7	—	40.2
Spanish	—	124.7	108.0	36.8
Avg.	161.1	129.8	133.2	42.9

MFA-Labeled Concat-300s (5-minute clips)

Language	Monotonic-Aligner	NFA	WhisperX	Qwen3-ForcedAligner-0.6B
Chinese	1742.4	235.0	—	36.5
English	—	226.7	227.2	58.6
French	—	230.6	2052.2	53.4
German	—	220.3	993.4	62.4
Italian	—	290.5	5719.4	81.6
Japanese	—	—	—	81.3
Korean	—	—	—	42.2
Portuguese	—	—	—	50.0
Russian	—	283.3	—	43.0
Spanish	—	240.2	4549.9	39.6
Cross-lingual	—	—	—	34.2
Avg.	1742.4	246.7	2708.4	52.9

Human-Labeled (Chinese)

Condition	Monotonic-Aligner	NFA	Qwen3-ForcedAligner-0.6B
Raw	49.9	88.6	27.8
Raw-Noisy	53.3	89.5	41.8
Concat-60s	51.1	86.7	25.3
Concat-300s	410.8	140.0	24.8
Concat-Cross-lingual	—	—	42.5
Avg.	141.3	101.2	32.4

Get Started

Inference

Deployment

Fine-Tuning

Reference

Qwen3-ASR Benchmark Results and Evaluation Methodology

Evaluation Methodology

English ASR Benchmarks (WER ↓)

Chinese ASR Benchmarks (WER ↓)

Chinese Dialect Benchmarks (WER ↓)

Internal Chinese Benchmarks (WER ↓)

Accented English (WER ↓)

Multilingual ASR Benchmarks (WER ↓)

Language Identification Accuracy (% ↑)

Singing Voice and Song Transcription (WER ↓)

Streaming vs. Offline Performance (WER ↓)

Forced Alignment Benchmarks (AAS ms ↓)

MFA-Labeled Raw

MFA-Labeled Concat-300s (5-minute clips)

Human-Labeled (Chinese)

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​Evaluation Methodology

​English ASR Benchmarks (WER ↓)

​Chinese ASR Benchmarks (WER ↓)

​Chinese Dialect Benchmarks (WER ↓)

​Internal Chinese Benchmarks (WER ↓)

​Accented English (WER ↓)

​Multilingual ASR Benchmarks (WER ↓)

​Language Identification Accuracy (% ↑)

​Singing Voice and Song Transcription (WER ↓)

​Streaming vs. Offline Performance (WER ↓)

​Forced Alignment Benchmarks (AAS ms ↓)

​MFA-Labeled Raw

​MFA-Labeled Concat-300s (5-minute clips)

​Human-Labeled (Chinese)

Build docs developers (and LLMs) love

Evaluation Methodology

English ASR Benchmarks (WER ↓)

Chinese ASR Benchmarks (WER ↓)

Chinese Dialect Benchmarks (WER ↓)

Internal Chinese Benchmarks (WER ↓)

Accented English (WER ↓)

Multilingual ASR Benchmarks (WER ↓)

Language Identification Accuracy (% ↑)

Singing Voice and Song Transcription (WER ↓)

Streaming vs. Offline Performance (WER ↓)

Forced Alignment Benchmarks (AAS ms ↓)

MFA-Labeled Raw

MFA-Labeled Concat-300s (5-minute clips)

Human-Labeled (Chinese)