Fine-tuning Qwen3-ASR lets you adapt the model’s speech recognition to your specific domain, acoustic environment, or language variety. TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
finetuning/qwen3_asr_sft.py script wraps Hugging Face Trainer and handles data loading, audio preprocessing, label masking, and checkpoint management — so you only need to supply a JSONL file and a handful of flags.
When to Fine-Tune
Out of the box, Qwen3-ASR covers 52 languages and dialects and performs competitively against the strongest commercial APIs. Fine-tuning is worthwhile when:- Domain-specific vocabulary — medical, legal, financial, or technical terms not well-represented in pre-training data.
- Accent or dialect adaptation — recordings from a speaker population whose accent differs substantially from the training distribution.
- Low-resource languages or dialects — languages where the pre-trained model has limited coverage.
- Proprietary or stylized speech — internal meeting recordings, call-centre audio, or other controlled audio sources where you hold transcripts.
- Consistent output formatting — you want the model to always follow a fixed punctuation or capitalisation style.
Fine-tuning modifies the existing weights rather than training from scratch, so even a few hundred labelled utterances can yield a meaningful improvement over the base model.
Prerequisites
Install the required packages before running the fine-tuning script.Install FlashAttention 2 (recommended)
FlashAttention 2 reduces GPU memory usage and speeds up training. It requires compatible hardware and a model loaded in If your machine has fewer than 96 GB of RAM but many CPU cores, limit the number of parallel compile jobs:See the FlashAttention repository for full hardware compatibility details.
torch.float16 or torch.bfloat16.Prepare your JSONL training data
You need at least one JSONL file of audio-transcript pairs. An optional evaluation JSONL file can be supplied to monitor validation loss during training. See the Data Format page for the exact schema.
Fine-Tuning Workflow
The end-to-end process covers two main steps: formatting your data and launching the training script.Data Format
Learn the JSONL schema, the
language {Language}<asr_text>{transcript} text field format, and best practices for preparing WAV files.Training
Run the fine-tuning script on a single GPU or scale to multiple GPUs with
torchrun. Covers all CLI flags, checkpoint resumption, and a one-click shell script.Quick Inference After Fine-Tuning
Once training completes, each saved checkpoint is a fully self-contained model directory that can be loaded directly withQwen3ASRModel.from_pretrained. The script copies all required tokenizer and config files into the checkpoint folder automatically.
"qwen3-asr-finetuning-out/checkpoint-200" with the path to whichever checkpoint you want to evaluate. The checkpoint number corresponds to the global training step at which it was saved.