The fine-tuning script expects training data as a JSONL file — one JSON object per line — where each line maps a WAV audio file to its ground-truth transcript. Preparing this file correctly is the most important step before you start training, because theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt
Use this file to discover all available pages before exploring further.
text field encoding directly controls whether the model learns language identification alongside transcription.
JSONL Format Specification
Each line in the training file must be a valid JSON object containing exactly two fields:| Field | Type | Description |
|---|---|---|
audio | string | Absolute or relative path to a WAV audio file on disk |
text | string | Transcript string, including a required language prefix (see below) |
.jsonl extension (or .json for a JSON-lines file). One record per line; no trailing commas; UTF-8 encoding.
The script uses the Hugging Face
datasets library to load the JSONL file. Any additional fields present in each line are silently dropped during preprocessing.The text Field Format
The text value is not a bare transcript. It follows a structured template that mirrors the model’s output format:
<asr_text> token is a fixed delimiter — the model was trained with this separator and expects it at inference time as well.
With Known Language
When you know the spoken language, supply it as an English-capitalised name immediately afterlanguage :
Without Language Information
If you do not have reliable language labels for your data, use the literal keywordNone:
Example JSONL File
Below is a representative training file combining multiple languages and theNone fallback:
Language Prefix Options
The{Language} placeholder should be a capitalised English language name. Examples of valid values:
English,Chinese,French,German,Spanish,Japanese,KoreanCantonese,Arabic,Hindi,Portuguese,Russian,Thai,Vietnamese- Any of the 52 languages and dialects supported by Qwen3-ASR
None— when language information is unavailable
Using
language None does not harm transcription quality. It simply prevents that sample from contributing to the language-identification head’s gradient update.Preparing Audio Files
The data collator insideqwen3_asr_sft.py loads each WAV file at runtime using librosa and resamples to 16,000 Hz mono. Requirements for your audio files:
- Format: WAV (uncompressed PCM is safest;
librosacan handle most sub-formats) - Sample rate: any — the loader resamples to 16 kHz automatically
- Channels: any — the loader forces mono
- Paths: must be accessible from the machine running the training script; absolute paths are the safest option
Optional Evaluation File
You can pass a second JSONL file with the--eval_file flag. It must follow the same format as the training file. When provided, the script evaluates on this split every --save_steps steps and logs the validation loss alongside the training loss.