Run Qwen3-ASR Fine-Tuning: Single and Multi-GPU

The finetuning/qwen3_asr_sft.py script drives the entire fine-tuning loop. It loads a Qwen3-ASR checkpoint, wraps it with Hugging Face Trainer, and saves fully self-contained checkpoints that can be used for inference without any additional steps. This page walks through the complete training workflow, covers every command-line argument, and provides a ready-to-run shell script.

Training Workflow

Prepare your JSONL data

Create a training file (and optionally a validation file) in the JSONL format described on the Data Format page.

Choose single-GPU or multi-GPU

For a single GPU use python qwen3_asr_sft.py. For multiple GPUs, use torchrun --nproc_per_node=N. See the commands below.

Monitor training

Loss and other metrics are printed every --log_steps steps. Checkpoints are written to {output_dir}/checkpoint-{global_step} every --save_steps steps.

Resume if interrupted

Pass --resume 1 to automatically pick up from the latest checkpoint in output_dir, or --resume_from ./path/to/checkpoint to resume from a specific one.

Run inference on your checkpoint

Load any saved checkpoint directly with Qwen3ASRModel.from_pretrained. See Overview — Quick Inference After Fine-Tuning.

Launch Commands

Single GPU
Multi-GPU (torchrun)

Run the script directly with python. This is the simplest setup and requires no distributed configuration.

python qwen3_asr_sft.py \
  --model_path Qwen/Qwen3-ASR-1.7B \
  --train_file ./train.jsonl \
  --output_dir ./qwen3-asr-finetuning-out \
  --batch_size 32 \
  --grad_acc 4 \
  --lr 2e-5 \
  --epochs 1 \
  --save_steps 200 \
  --save_total_limit 5

Checkpoints are written to ./qwen3-asr-finetuning-out/checkpoint-<global_step>.

Use torchrun for data-parallel training across multiple GPUs on a single node. Set CUDA_VISIBLE_DEVICES to control which GPUs are used and --nproc_per_node to the number of selected GPUs.

export CUDA_VISIBLE_DEVICES=0,1
torchrun --nproc_per_node=2 qwen3_asr_sft.py \
  --model_path Qwen/Qwen3-ASR-1.7B \
  --train_file ./train.jsonl \
  --output_dir ./qwen3-asr-finetuning-out \
  --batch_size 32 \
  --grad_acc 4 \
  --lr 2e-5 \
  --epochs 1 \
  --save_steps 200

The effective batch size per update is batch_size × grad_acc × nproc_per_node.

Resuming Training

If a training run is interrupted, you can resume from any saved checkpoint without losing progress.

Explicit checkpoint path
Auto-resume (latest checkpoint)

Point --resume_from at a specific checkpoint directory:

python qwen3_asr_sft.py \
  --train_file ./train.jsonl \
  --output_dir ./qwen3-asr-finetuning-out \
  --resume_from ./qwen3-asr-finetuning-out/checkpoint-200

Set --resume 1 to let the script find and load the highest-numbered checkpoint inside output_dir automatically:

python qwen3_asr_sft.py \
  --train_file ./train.jsonl \
  --output_dir ./qwen3-asr-finetuning-out \
  --resume 1

Training Arguments Reference

Paths

--model_path

string

default:"Qwen/Qwen3-ASR-1.7B"

Path to a local model directory or a Hugging Face Hub repository ID. The script calls Qwen3ASRModel.from_pretrained with this value, so any valid Hub ID or local path works.

--train_file

string

default:"train.jsonl"

Path to the JSONL training file. Each line must contain audio and text fields. Required.

--eval_file

string

default:""

Optional path to a JSONL evaluation file in the same format as --train_file. When provided, validation loss is computed every --save_steps steps.

--output_dir

string

default:"./qwen3-asr-finetuning-out"

Directory where checkpoints are written. Each checkpoint is saved as {output_dir}/checkpoint-{global_step} and contains the model weights plus all files needed for inference.

Audio

--sr

int

default:"16000"

Target sample rate in Hz for audio loading. All WAV files are resampled to this rate by librosa before being passed to the model’s processor. The Qwen3-ASR processor expects 16,000 Hz, so this value should not normally be changed.

Training Hyperparameters

--batch_size

int

default:"32"

Per-device training batch size. This is the number of samples processed on each GPU per forward-backward pass.

--grad_acc

int

default:"4"

Gradient accumulation steps. Gradients are accumulated over this many mini-batches before an optimiser step, effectively multiplying the batch size without increasing memory usage.

--lr

float

default:"2e-5"

Peak learning rate for the AdamW optimiser. The scheduler type defaults to linear with a warm-up ratio of 0.02.

--epochs

float

default:"1"

Number of training epochs. Fractional values are accepted (e.g., 0.5 for half an epoch).

--log_steps

int

default:"10"

Log training metrics every N global steps.

--lr_scheduler_type

string

default:"linear"

Learning rate scheduler type. Passed directly to TrainingArguments. Common values: "linear", "cosine", "constant".

--warmup_ratio

float

default:"0.02"

Fraction of total training steps used for linear learning-rate warm-up.

Checkpoint Settings

--save_strategy

string

default:"steps"

When to save checkpoints. "steps" saves every --save_steps global steps. Other values accepted by Hugging Face TrainingArguments are also valid.

--save_steps

int

default:"200"

Save a checkpoint (and run evaluation, if --eval_file is provided) every N global steps.

--save_total_limit

int

default:"5"

Maximum number of checkpoints to keep on disk. Older checkpoints are deleted when this limit is exceeded.

Resuming

--resume_from

string

default:""

Explicit path to a checkpoint directory to resume from. Takes precedence over --resume.

--resume

int

default:"0"

Set to 1 to automatically resume from the latest checkpoint found inside --output_dir. Ignored if --resume_from is also set.

DataLoader Performance Options

These flags control the PyTorch DataLoader used during training. Tuning them can improve GPU utilisation, especially when audio loading is the bottleneck.

--num_workers

int

default:"4"

Number of worker processes for the DataLoader. Increase this if CPU-side audio loading is a bottleneck. Set to 0 to load data in the main process (useful for debugging).

--pin_memory

int

default:"1"

Set to 1 to enable pinned (page-locked) memory for faster host-to-device transfers. Disable (0) if you experience memory pressure.

--persistent_workers

int

default:"1"

Set to 1 to keep worker processes alive between epochs, avoiding the overhead of relaunching them. Requires --num_workers > 0.

--prefetch_factor

int

default:"2"

Number of batches each worker prefetches. Higher values reduce idle GPU time but increase memory usage. Has no effect when --num_workers is 0.

One-Click Shell Script

The following self-contained script mirrors the full multi-GPU example from the fine-tuning README and sets all recommended DataLoader flags. Save it as run_finetune.sh, make it executable, and run it directly.

#!/usr/bin/env bash
set -e

export CUDA_VISIBLE_DEVICES=0,1

MODEL_PATH="Qwen/Qwen3-ASR-1.7B"
TRAIN_FILE="./train.jsonl"
EVAL_FILE="./eval.jsonl"
OUTPUT_DIR="./qwen3-asr-finetuning-out"

torchrun --nproc_per_node=2 qwen3_asr_sft.py \
  --model_path ${MODEL_PATH} \
  --train_file ${TRAIN_FILE} \
  --eval_file ${EVAL_FILE} \
  --output_dir ${OUTPUT_DIR} \
  --batch_size 32 \
  --grad_acc 4 \
  --lr 2e-5 \
  --epochs 1 \
  --log_steps 10 \
  --save_strategy steps \
  --save_steps 200 \
  --save_total_limit 5 \
  --num_workers 2 \
  --pin_memory 1 \
  --persistent_workers 1 \
  --prefetch_factor 2

Remove the --eval_file line if you do not have a validation set. The script will skip evaluation steps and only report training loss.

Get Started

Inference

Deployment

Fine-Tuning

Reference

Run Qwen3-ASR Fine-Tuning: Single and Multi-GPU

Training Workflow

Launch Commands

Resuming Training

Training Arguments Reference

Paths

Audio

Training Hyperparameters

Checkpoint Settings

Resuming

DataLoader Performance Options

One-Click Shell Script

Build docs developers (and LLMs) love

Get Started

Inference

Deployment

Fine-Tuning

Reference

Documentation Index

​Training Workflow

​Launch Commands

​Resuming Training

​Training Arguments Reference

​Paths

​Audio

​Training Hyperparameters

​Checkpoint Settings

​Resuming

​DataLoader Performance Options

​One-Click Shell Script

Build docs developers (and LLMs) love

Training Workflow

Launch Commands

Resuming Training

Training Arguments Reference

Paths

Audio

Training Hyperparameters

Checkpoint Settings

Resuming

DataLoader Performance Options

One-Click Shell Script