Fine-Tune Qwen3-ASR on Custom Audio Data

Fine-tuning Qwen3-ASR lets you adapt the model’s speech recognition to your specific domain, acoustic environment, or language variety. The finetuning/qwen3_asr_sft.py script wraps Hugging Face Trainer and handles data loading, audio preprocessing, label masking, and checkpoint management — so you only need to supply a JSONL file and a handful of flags.

When to Fine-Tune

Out of the box, Qwen3-ASR covers 52 languages and dialects and performs competitively against the strongest commercial APIs. Fine-tuning is worthwhile when:

Domain-specific vocabulary — medical, legal, financial, or technical terms not well-represented in pre-training data.
Accent or dialect adaptation — recordings from a speaker population whose accent differs substantially from the training distribution.
Low-resource languages or dialects — languages where the pre-trained model has limited coverage.
Proprietary or stylized speech — internal meeting recordings, call-centre audio, or other controlled audio sources where you hold transcripts.
Consistent output formatting — you want the model to always follow a fixed punctuation or capitalisation style.

Fine-tuning modifies the existing weights rather than training from scratch, so even a few hundred labelled utterances can yield a meaningful improvement over the base model.

Prerequisites

Install the required packages before running the fine-tuning script.

Install core packages

pip install -U qwen-asr datasets

Install FlashAttention 2 (recommended)

FlashAttention 2 reduces GPU memory usage and speeds up training. It requires compatible hardware and a model loaded in torch.float16 or torch.bfloat16.

pip install -U flash-attn --no-build-isolation

If your machine has fewer than 96 GB of RAM but many CPU cores, limit the number of parallel compile jobs:

MAX_JOBS=4 pip install -U flash-attn --no-build-isolation

See the FlashAttention repository for full hardware compatibility details.

Prepare your JSONL training data

You need at least one JSONL file of audio-transcript pairs. An optional evaluation JSONL file can be supplied to monitor validation loss during training. See the Data Format page for the exact schema.

Fine-Tuning Workflow

The end-to-end process covers two main steps: formatting your data and launching the training script.

Data Format

Learn the JSONL schema, the language {Language}<asr_text>{transcript} text field format, and best practices for preparing WAV files.

Training

Run the fine-tuning script on a single GPU or scale to multiple GPUs with torchrun. Covers all CLI flags, checkpoint resumption, and a one-click shell script.

Quick Inference After Fine-Tuning

Once training completes, each saved checkpoint is a fully self-contained model directory that can be loaded directly with Qwen3ASRModel.from_pretrained. The script copies all required tokenizer and config files into the checkpoint folder automatically.

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "qwen3-asr-finetuning-out/checkpoint-200",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)
results = model.transcribe(audio="path/to/audio.wav")
print(results[0].language)
print(results[0].text)

Replace "qwen3-asr-finetuning-out/checkpoint-200" with the path to whichever checkpoint you want to evaluate. The checkpoint number corresponds to the global training step at which it was saved.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Fine-Tune Qwen3-ASR on Custom Audio Data

When to Fine-Tune

Prerequisites

Fine-Tuning Workflow

Data Format

Training

Quick Inference After Fine-Tuning

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​When to Fine-Tune

​Prerequisites

​Fine-Tuning Workflow

Data Format

Training

​Quick Inference After Fine-Tuning

Build docs developers (and LLMs) love

When to Fine-Tune

Prerequisites

Fine-Tuning Workflow

Quick Inference After Fine-Tuning