Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

This page presents the full evaluation results for Qwen3-ASR-1.7B, Qwen3-ASR-0.6B, and Qwen3-ForcedAligner-0.6B across public and internal benchmarks. Results span English ASR, Chinese Mandarin and dialect ASR, multilingual ASR, language identification, singing voice and song transcription, and forced alignment accuracy.

Evaluation Methodology

All ASR inference was conducted with the following settings:
All models were evaluated with dtype=torch.bfloat16 and max_new_tokens=1024 using the vLLM backend. Greedy decoding was applied throughout. No language parameter was specified — all language identification was performed automatically by the model. Whisper-large-v3 and other open-source baselines were evaluated under equivalent conditions where possible.
SettingValue
Precisiontorch.bfloat16
BackendvLLM
DecodingGreedy search
Language parameterNone (automatic detection)
max_new_tokens1024

English ASR Benchmarks (WER ↓)

Lower WER is better. Bold indicates the best result among the models compared.
DatasetGPT-4o-TranscribeGemini-2.5-ProDoubao-ASRWhisper-large-v3Fun-ASR-MLT-NanoQwen3-ASR-0.6BQwen3-ASR-1.7B
LibriSpeech clean | other1.39 | 3.752.89 | 3.562.78 | 5.701.51 | 3.971.68 | 4.032.11 | 4.551.63 | 3.38
GigaSpeech25.509.379.559.768.888.45
CV-en9.0814.4913.789.909.909.927.39
Fleurs-en2.402.946.314.085.494.393.35
MLS-en5.123.687.094.876.004.58
Tedlium7.696.154.916.843.854.50
VoxPopuli10.2911.3612.1212.059.969.15
Qwen3-ASR-1.7B achieves the best published result on LibriSpeech other, GigaSpeech, and CV-en among open-source models. Qwen3-ASR-0.6B leads on Tedlium and VoxPopuli.

Chinese ASR Benchmarks (WER ↓)

DatasetGPT-4o-TranscribeGemini-2.5-ProDoubao-ASRWhisper-large-v3Fun-ASR-MLT-NanoQwen3-ASR-0.6BQwen3-ASR-1.7B
WenetSpeech net | meeting15.30 | 32.2714.43 | 13.47N/A9.86 | 19.116.35 | —5.97 | 6.884.97 | 5.88
AISHELL-2-test4.2411.622.855.063.152.71
SpeechIO12.865.302.937.563.442.88
Fleurs-zh2.442.712.694.093.512.882.41
CV-zh6.327.705.9512.916.206.895.35

Chinese Dialect Benchmarks (WER ↓)

DatasetGPT-4o-TranscribeGemini-2.5-ProDoubao-ASRWhisper-large-v3Fun-ASR-MLT-NanoQwen3-ASR-0.6BQwen3-ASR-1.7B
KeSpeech26.8724.715.2728.797.085.10
Fleurs-yue4.989.434.989.185.793.98
CV-yue11.3618.7613.2016.239.507.57
CV-zh-tw (Traditional)6.327.314.067.845.593.77
WenetSpeech-Yue short | long15.62 | 25.2925.19 | 11.239.74 | 11.4032.26 | 46.64— | —7.54 | 9.925.82 | 8.85
WenetSpeech-Chuan easy | hard34.81 | 53.9843.79 | 67.3011.40 | 20.2014.35 | 26.80— | —13.92 | 24.4511.99 | 21.63

Internal Chinese Benchmarks (WER ↓)

DatasetGPT-4o-TranscribeGemini-2.5-ProDoubao-ASRWhisper-large-v3Fun-ASR-MLT-NanoQwen3-ASR-0.6BQwen3-ASR-1.7B
Elders & Kids14.2736.934.1710.614.544.483.81
ExtremeNoise36.1129.0617.0463.1736.5517.8816.17
TongueTwister20.874.973.4716.639.024.062.44
Dialog-Mandarin20.7312.506.6114.017.327.066.54
Dialog-Cantonese16.0514.987.5631.045.854.804.12
Dialog-Chinese Dialects (avg. 22 dialects)45.3747.7019.8544.5519.4118.2415.94

Accented English (WER ↓)

DatasetGPT-4o-TranscribeGemini-2.5-ProDoubao-ASRWhisper-large-v3Fun-ASR-MLT-NanoQwen3-ASR-0.6BQwen3-ASR-1.7B
Dialog-Accented English (avg. 16 accents)28.5623.8520.4121.3019.9616.6216.07

Multilingual ASR Benchmarks (WER ↓)

The following results compare Qwen3-ASR against GLM-ASR-Nano-2512, Whisper-large-v3, and Fun-ASR-MLT-Nano across open-source and internal multilingual benchmarks.
DatasetLanguagesGLM-ASR-Nano-2512Whisper-large-v3Fun-ASR-MLT-NanoQwen3-ASR-0.6BQwen3-ASR-1.7B
MLS8 (da, de, en, es, fr, it, pl, pt)13.328.6228.7013.198.55
CommonVoice13 (en, zh, yue, zh-TW, ar, de, es, fr, it, ja, ko, pt, ru)19.4010.7717.2512.759.18
MLC-SLM11 (en, fr, de, it, pt, es, ja, ko, ru, th, vi)34.9315.6829.9415.8412.74
Fleurs (12 languages)en, zh, yue, ar, de, es, fr, it, ja, ko, pt, ru16.085.2710.037.574.90
Fleurs† (8 additional languages)hi, id, ms, nl, pl, th, tr, vi20.056.8531.8910.376.62
Fleurs†† (10 further languages)cs, da, el, fa, fi, fil, hu, mk, ro, sv24.838.1647.8421.8012.60
News-Multilingual (internal)15 languages49.4014.8065.0717.3912.80

Language Identification Accuracy (% ↑)

Higher accuracy is better.
BenchmarkWhisper-large-v3Qwen3-ASR-0.6BQwen3-ASR-1.7B
MLS99.999.399.9
CommonVoice92.798.298.7
MLC-SLM89.292.794.1
Fleurs (30 languages)94.697.198.7
Average94.196.897.9
Qwen3-ASR-1.7B achieves 97.9% average language identification accuracy. Qwen3-ASR-0.6B achieves 96.8%, outperforming Whisper-large-v3 on all four benchmarks.

Singing Voice and Song Transcription (WER ↓)

Qwen3-ASR-1.7B is evaluated on singing voice (isolated vocals) and full songs with background music (BGM). Results for Qwen3-ASR-0.6B are not available on these benchmarks.
DatasetGPT-4o-TranscribeGemini-2.5-ProDoubao-ASR-1.0Whisper-large-v3Fun-ASR-MLT-NanoQwen3-ASR-1.7B
Singing (isolated vocals)
M4Singer16.7720.887.8813.587.295.98
MIR-1k-vocal11.879.856.5611.718.176.25
Opencpop7.936.493.809.522.983.08
Popcs32.8415.138.9713.779.428.52
Songs with BGM
EntireSongs-en30.7112.1833.51N/AN/A14.60
EntireSongs-zh34.8618.6823.99N/AN/A13.91

Streaming vs. Offline Performance (WER ↓)

Both ASR models use a single set of weights for both offline and streaming inference.
ModelModeLibriSpeech clean | otherFleurs-enFleurs-zhAvg.
Qwen3-ASR-1.7BOffline1.63 | 3.383.352.412.69
Qwen3-ASR-1.7BStreaming1.95 | 4.514.022.843.33
Qwen3-ASR-0.6BOffline2.11 | 4.554.392.883.48
Qwen3-ASR-0.6BStreaming2.54 | 6.275.383.404.40

Forced Alignment Benchmarks (AAS ms ↓)

AAS (Average Absolute error in Seconds, reported in milliseconds) measures the accuracy of predicted timestamps against reference alignments. Lower is better.

MFA-Labeled Raw

LanguageMonotonic-AlignerNFAWhisperXQwen3-ForcedAligner-0.6B
Chinese161.1109.833.1
English107.592.137.5
French100.7145.341.7
German122.7165.146.5
Italian142.7155.575.5
Japanese42.2
Korean37.2
Portuguese38.4
Russian200.740.2
Spanish124.7108.036.8
Avg.161.1129.8133.242.9

MFA-Labeled Concat-300s (5-minute clips)

LanguageMonotonic-AlignerNFAWhisperXQwen3-ForcedAligner-0.6B
Chinese1742.4235.036.5
English226.7227.258.6
French230.62052.253.4
German220.3993.462.4
Italian290.55719.481.6
Japanese81.3
Korean42.2
Portuguese50.0
Russian283.343.0
Spanish240.24549.939.6
Cross-lingual34.2
Avg.1742.4246.72708.452.9

Human-Labeled (Chinese)

ConditionMonotonic-AlignerNFAQwen3-ForcedAligner-0.6B
Raw49.988.627.8
Raw-Noisy53.389.541.8
Concat-60s51.186.725.3
Concat-300s410.8140.024.8
Concat-Cross-lingual42.5
Avg.141.3101.232.4

Build docs developers (and LLMs) love