Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-ASR/llms.txt

Use this file to discover all available pages before exploring further.

The fine-tuning script expects training data as a JSONL file — one JSON object per line — where each line maps a WAV audio file to its ground-truth transcript. Preparing this file correctly is the most important step before you start training, because the text field encoding directly controls whether the model learns language identification alongside transcription.

JSONL Format Specification

Each line in the training file must be a valid JSON object containing exactly two fields:
FieldTypeDescription
audiostringAbsolute or relative path to a WAV audio file on disk
textstringTranscript string, including a required language prefix (see below)
The file must use the .jsonl extension (or .json for a JSON-lines file). One record per line; no trailing commas; UTF-8 encoding.
The script uses the Hugging Face datasets library to load the JSONL file. Any additional fields present in each line are silently dropped during preprocessing.

The text Field Format

The text value is not a bare transcript. It follows a structured template that mirrors the model’s output format:
language {Language}<asr_text>{Transcript}
The <asr_text> token is a fixed delimiter — the model was trained with this separator and expects it at inference time as well.

With Known Language

When you know the spoken language, supply it as an English-capitalised name immediately after language :
language English<asr_text>This is a test sentence.
language Chinese<asr_text>你好世界
language French<asr_text>Bonjour le monde
language Japanese<asr_text>こんにちは世界
Training with an explicit language label teaches the model to produce correct language identification output for that utterance.

Without Language Information

If you do not have reliable language labels for your data, use the literal keyword None:
language None<asr_text>Transcript text here
When language None is used, the model will not learn language-detection behaviour from that example — only the transcription path is trained. Use this when your language labels are unavailable or unreliable rather than guessing the language, which could introduce noisy supervision.

Example JSONL File

Below is a representative training file combining multiple languages and the None fallback:
{"audio":"/data/wavs/utt0001.wav","text":"language English<asr_text>This is a test sentence."}
{"audio":"/data/wavs/utt0002.wav","text":"language Chinese<asr_text>你好世界"}
{"audio":"/data/wavs/utt0003.wav","text":"language None<asr_text>transcript without language info"}
{"audio":"/data/wavs/utt0004.wav","text":"language English<asr_text>Another example from the training set."}
{"audio":"/data/wavs/utt0005.wav","text":"language French<asr_text>Une autre phrase en français."}
Each line is independent. You can mix languages freely within the same file.

Language Prefix Options

The {Language} placeholder should be a capitalised English language name. Examples of valid values:
  • English, Chinese, French, German, Spanish, Japanese, Korean
  • Cantonese, Arabic, Hindi, Portuguese, Russian, Thai, Vietnamese
  • Any of the 52 languages and dialects supported by Qwen3-ASR
  • None — when language information is unavailable
Using language None does not harm transcription quality. It simply prevents that sample from contributing to the language-identification head’s gradient update.

Preparing Audio Files

The data collator inside qwen3_asr_sft.py loads each WAV file at runtime using librosa and resamples to 16,000 Hz mono. Requirements for your audio files:
  • Format: WAV (uncompressed PCM is safest; librosa can handle most sub-formats)
  • Sample rate: any — the loader resamples to 16 kHz automatically
  • Channels: any — the loader forces mono
  • Paths: must be accessible from the machine running the training script; absolute paths are the safest option
Missing or unreadable audio files cause a runtime error during the collation step. Verify that every path in your JSONL file resolves correctly before starting a long training run.

Optional Evaluation File

You can pass a second JSONL file with the --eval_file flag. It must follow the same format as the training file. When provided, the script evaluates on this split every --save_steps steps and logs the validation loss alongside the training loss.

Build docs developers (and LLMs) love