Preparing Training Data for Qwen3-ASR Fine-Tuning

The fine-tuning script expects training data as a JSONL file — one JSON object per line — where each line maps a WAV audio file to its ground-truth transcript. Preparing this file correctly is the most important step before you start training, because the text field encoding directly controls whether the model learns language identification alongside transcription.

JSONL Format Specification

Each line in the training file must be a valid JSON object containing exactly two fields:

Field	Type	Description
`audio`	string	Absolute or relative path to a WAV audio file on disk
`text`	string	Transcript string, including a required language prefix (see below)

The file must use the .jsonl extension (or .json for a JSON-lines file). One record per line; no trailing commas; UTF-8 encoding.

The script uses the Hugging Face datasets library to load the JSONL file. Any additional fields present in each line are silently dropped during preprocessing.

The `text` Field Format

The text value is not a bare transcript. It follows a structured template that mirrors the model’s output format:

language {Language}<asr_text>{Transcript}

The <asr_text> token is a fixed delimiter — the model was trained with this separator and expects it at inference time as well.

With Known Language

When you know the spoken language, supply it as an English-capitalised name immediately after language :

language English<asr_text>This is a test sentence.
language Chinese<asr_text>你好世界
language French<asr_text>Bonjour le monde
language Japanese<asr_text>こんにちは世界

Training with an explicit language label teaches the model to produce correct language identification output for that utterance.

Without Language Information

If you do not have reliable language labels for your data, use the literal keyword None:

language None<asr_text>Transcript text here

When language None is used, the model will not learn language-detection behaviour from that example — only the transcription path is trained. Use this when your language labels are unavailable or unreliable rather than guessing the language, which could introduce noisy supervision.

Example JSONL File

Below is a representative training file combining multiple languages and the None fallback:

{"audio":"/data/wavs/utt0001.wav","text":"language English<asr_text>This is a test sentence."}
{"audio":"/data/wavs/utt0002.wav","text":"language Chinese<asr_text>你好世界"}
{"audio":"/data/wavs/utt0003.wav","text":"language None<asr_text>transcript without language info"}
{"audio":"/data/wavs/utt0004.wav","text":"language English<asr_text>Another example from the training set."}
{"audio":"/data/wavs/utt0005.wav","text":"language French<asr_text>Une autre phrase en français."}

Each line is independent. You can mix languages freely within the same file.

Language Prefix Options

The {Language} placeholder should be a capitalised English language name. Examples of valid values:

English, Chinese, French, German, Spanish, Japanese, Korean
Cantonese, Arabic, Hindi, Portuguese, Russian, Thai, Vietnamese
Any of the 52 languages and dialects supported by Qwen3-ASR
None — when language information is unavailable

Using language None does not harm transcription quality. It simply prevents that sample from contributing to the language-identification head’s gradient update.

Preparing Audio Files

The data collator inside qwen3_asr_sft.py loads each WAV file at runtime using librosa and resamples to 16,000 Hz mono. Requirements for your audio files:

Format: WAV (uncompressed PCM is safest; librosa can handle most sub-formats)
Sample rate: any — the loader resamples to 16 kHz automatically
Channels: any — the loader forces mono
Paths: must be accessible from the machine running the training script; absolute paths are the safest option

Missing or unreadable audio files cause a runtime error during the collation step. Verify that every path in your JSONL file resolves correctly before starting a long training run.

Optional Evaluation File

You can pass a second JSONL file with the --eval_file flag. It must follow the same format as the training file. When provided, the script evaluates on this split every --save_steps steps and logs the validation loss alongside the training loss.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Preparing Training Data for Qwen3-ASR Fine-Tuning

JSONL Format Specification

The `text` Field Format

With Known Language

Without Language Information

Example JSONL File

Language Prefix Options

Preparing Audio Files

Optional Evaluation File

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​JSONL Format Specification

​The text Field Format

​With Known Language

​Without Language Information

​Example JSONL File

​Language Prefix Options

​Preparing Audio Files

​Optional Evaluation File

Build docs developers (and LLMs) love

JSONL Format Specification

The `text` Field Format

With Known Language

Without Language Information

Example JSONL File

Language Prefix Options

Preparing Audio Files

Optional Evaluation File