Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

Fine-tuning Qwen3-ASR lets you adapt the model’s speech recognition to your specific domain, acoustic environment, or language variety. The finetuning/qwen3_asr_sft.py script wraps Hugging Face Trainer and handles data loading, audio preprocessing, label masking, and checkpoint management — so you only need to supply a JSONL file and a handful of flags.

When to Fine-Tune

Out of the box, Qwen3-ASR covers 52 languages and dialects and performs competitively against the strongest commercial APIs. Fine-tuning is worthwhile when:
  • Domain-specific vocabulary — medical, legal, financial, or technical terms not well-represented in pre-training data.
  • Accent or dialect adaptation — recordings from a speaker population whose accent differs substantially from the training distribution.
  • Low-resource languages or dialects — languages where the pre-trained model has limited coverage.
  • Proprietary or stylized speech — internal meeting recordings, call-centre audio, or other controlled audio sources where you hold transcripts.
  • Consistent output formatting — you want the model to always follow a fixed punctuation or capitalisation style.
Fine-tuning modifies the existing weights rather than training from scratch, so even a few hundred labelled utterances can yield a meaningful improvement over the base model.

Prerequisites

Install the required packages before running the fine-tuning script.
1

Install core packages

pip install -U qwen-asr datasets
2

Install FlashAttention 2 (recommended)

FlashAttention 2 reduces GPU memory usage and speeds up training. It requires compatible hardware and a model loaded in torch.float16 or torch.bfloat16.
pip install -U flash-attn --no-build-isolation
If your machine has fewer than 96 GB of RAM but many CPU cores, limit the number of parallel compile jobs:
MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
See the FlashAttention repository for full hardware compatibility details.
3

Prepare your JSONL training data

You need at least one JSONL file of audio-transcript pairs. An optional evaluation JSONL file can be supplied to monitor validation loss during training. See the Data Format page for the exact schema.

Fine-Tuning Workflow

The end-to-end process covers two main steps: formatting your data and launching the training script.

Data Format

Learn the JSONL schema, the language {Language}<asr_text>{transcript} text field format, and best practices for preparing WAV files.

Training

Run the fine-tuning script on a single GPU or scale to multiple GPUs with torchrun. Covers all CLI flags, checkpoint resumption, and a one-click shell script.

Quick Inference After Fine-Tuning

Once training completes, each saved checkpoint is a fully self-contained model directory that can be loaded directly with Qwen3ASRModel.from_pretrained. The script copies all required tokenizer and config files into the checkpoint folder automatically.
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "qwen3-asr-finetuning-out/checkpoint-200",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)
results = model.transcribe(audio="path/to/audio.wav")
print(results[0].language)
print(results[0].text)
Replace "qwen3-asr-finetuning-out/checkpoint-200" with the path to whichever checkpoint you want to evaluate. The checkpoint number corresponds to the global training step at which it was saved.

Build docs developers (and LLMs) love