Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Complete Training Script

Here’s the complete training script with full parameter documentation:
launch_training.sh
#!/bin/bash
# Complete QwenVL Training Launch Script with Full Parameter Documentation

# ======================
# Distributed Configuration
# ======================
MASTER_ADDR="127.0.0.1"                     # [Required] Master node IP for multi-GPU training
MASTER_PORT=$(shuf -i 20000-29999 -n 1)     # Random port to avoid conflicts
NPROC_PER_NODE=$(nvidia-smi --list-gpus | wc -l)  # Automatically detects available GPUs

# ======================
# Path Configuration
# ======================
MODEL_PATH="/path/to/Qwen2.5-VL-3B-Instruct"  # [ModelArguments] Pretrained model path
OUTPUT_DIR="./checkpoints"                   # Directory for saving checkpoints
CACHE_DIR="./cache"                          # [TrainingArguments] Cache directory for models

# ======================
# Model Configuration
# ======================
DATASETS="your_dataset%100"                  # [DataArguments] Dataset with sampling rate

# ======================
# Training Hyperparameters
# ======================
torchrun --nproc_per_node=$NPROC_PER_NODE \
         --master_addr=$MASTER_ADDR \
         --master_port=$MASTER_PORT \
         qwenvl/train/train_qwen.py \
         # Core Arguments
         --model_name_or_path $MODEL_PATH \  # [ModelArguments] Model identifier
         --tune_mm_llm True \                # [TrainingArguments] Train LLM or not
         --tune_mm_vision False \            # [TrainingArguments] Train VIT or not
         --tune_mm_mlp False \               # [TrainingArguments] Train MLP or not
         --dataset_use $DATASETS \           # [DataArguments] Dataset specification
         --output_dir $OUTPUT_DIR \          # Output directory for checkpoints
         --cache_dir $CACHE_DIR \            # [TrainingArguments] Model cache location
         
         # Precision & Memory
         --bf16 \                            # Use bfloat16 precision (Ampere+ GPUs)
         --per_device_train_batch_size 4 \   # Batch size per GPU
         --gradient_accumulation_steps 4 \   # Effective batch size multiplier
         
         # Learning Rate Configuration
         --learning_rate 2e-7 \              # Base learning rate
         --mm_projector_lr 1e-5 \            # [TrainingArguments] Projector-specific LR
         --vision_tower_lr 1e-6 \            # [TrainingArguments] Vision encoder LR
         --optim adamw_torch \               # [TrainingArguments] Optimizer selection
         
         # Sequence Configuration
         --model_max_length 4096 \           # [TrainingArguments] Max sequence length
         --data_flatten True \               # [DataArguments] Concatenate batch sequences
         --data_packing True \               # [DataArguments] Using packing data
         
         # Image Processing
         --max_pixels 576\*28\*28 \               # [DataArguments] Max image pixels (H*W) for image
         --min_pixels 16\*28\*28 \                # [DataArguments] Min image pixels for image
         # Video Processing
         --video_fps 2 \                          # [DataArguments] video fps
         --video_max_frames 8 \                   # [DataArguments] Max frames per video
         --video_min_frames 4 \                   # [DataArguments] Min frames per video
         --video_max_pixels 1664\*28\*28 \        # [DataArguments] Max pixels per video
         --video_min_pixels 256\*28\*28 \         # [DataArguments] Min pixels per video
         
         # Training Schedule
         --num_train_epochs 3 \              # Total training epochs
         --warmup_ratio 0.03 \               # LR warmup proportion
         --lr_scheduler_type "cosine" \      # Learning rate schedule
         --weight_decay 0.01 \               # L2 regularization strength
         
         # Logging & Checkpoints
         --logging_steps 10 \               # Log metrics interval
         --save_steps 500 \                 # Checkpoint save interval
         --save_total_limit 3 \             # Max checkpoints to keep

         # Lora Config
         --lora_enable True \                 # [TrainingArguments] Enable LoRA
         --lora_r 8 \                         # [TrainingArguments] LoRA r
         --lora_alpha 16 \                    # [TrainingArguments] LoRA alpha 
         --lora_dropout 0.0 \                # [TrainingArguments] LoRA dropout

         # Advanced Options
         --deepspeed zero3.json \           # DeepSpeed configuration

Parameter Categories

The script accepts arguments in three main categories:

Model Arguments

ParameterDescriptionDefault
--model_name_or_pathPath or identifier for pretrained modelRequired

Training Arguments

Component Training Flags

ParameterDescriptionRecommended
--tune_mm_llmWhether to train the language modelTrue
--tune_mm_visionWhether to train the vision encoderFalse (for mixed image/video)
--tune_mm_mlpWhether to train the MLP projectorFalse
When training with both image and video data, set --tune_mm_vision False to avoid instability.

Precision & Memory

ParameterDescriptionValue
--bf16Use bfloat16 precision (requires Ampere+ GPUs)Flag
--per_device_train_batch_sizeBatch size per GPU4
--gradient_accumulation_stepsGradient accumulation steps4
--cache_dirCache directory for models./cache

Learning Rate Configuration

ParameterDescriptionRange
--learning_rateBase learning rate for the model1e-6 to 2e-7
--mm_projector_lrLearning rate for multimodal projector1e-5
--vision_tower_lrLearning rate for vision encoder1e-6
--optimOptimizer typeadamw_torch
The suggested learning rate range is from 1e-6 to 2e-7. Start with 2e-7 for stable training.

Sequence Configuration

ParameterDescriptionValue
--model_max_lengthMaximum sequence length4096

Data Arguments

Dataset Selection

ParameterDescriptionExample
--dataset_useDataset names with sampling rates"my_dataset%100"

Data Processing

ParameterDescriptionDefault
--data_flattenConcatenate batch sequences into oneTrue
--data_packingUse packed data (requires preprocessing)True
  • data_flatten=True means data in a batch are concatenated into one sequence
  • data_packing=True requires preprocessing with tools/pack_data.py

Image Processing

ParameterDescriptionValue
--max_pixelsMaximum image pixels (H×W)576*28*28
--min_pixelsMinimum image pixels16*28*28

Video Processing

ParameterDescriptionValue
--video_fpsVideo frames per second2
--video_max_framesMaximum frames per video8
--video_min_framesMinimum frames per video4
--video_max_pixelsMaximum pixels per video1664*28*28
--video_min_pixelsMinimum pixels per video256*28*28
Training resolution is critical for model performance. Ensure --max_pixels and --min_pixels are properly set for your use case.

Training Schedule

ParameterDescriptionValue
--num_train_epochsTotal training epochs3
--warmup_ratioLearning rate warmup proportion0.03
--lr_scheduler_typeLearning rate schedulecosine
--weight_decayL2 regularization strength0.01

Logging & Checkpoints

ParameterDescriptionValue
--logging_stepsInterval for logging metrics10
--save_stepsInterval for saving checkpoints500
--save_total_limitMaximum checkpoints to keep3

Advanced Options

DeepSpeed Configuration

--deepspeed zero3.json
Provide a DeepSpeed configuration file for distributed training optimization.
The Qwen3VL MoE model does not support DeepSpeed with ZeRO-3. Additionally, Hugging Face’s official implementation does not include support for load balancing loss currently.

Flash Attention

To enable Flash Attention 2, add the following to your model’s config.json:
config.json
{
  "_attn_implementation": "flash_attention_2",
  ...
}

Hardware Requirements

Training Qwen2.5-VL-3B

Minimum requirements:
  • 4x GPUs with 24GB VRAM (e.g., RTX 3090, RTX 4090)
  • With DeepSpeed ZeRO-3 and gradient checkpointing

Training Qwen2.5-VL-32B

Recommended configuration:
  • 8x 80GB GPUs (e.g., A100, H100)
  • Refer to scripts/sft_32b.sh for configuration

Example Usage

Basic Training

bash launch_training.sh

Single GPU Training

NPROC_PER_NODE=1 bash launch_training.sh

Multi-node Training

On the master node:
MASTER_ADDR="192.168.1.1" bash launch_training.sh
On worker nodes:
MASTER_ADDR="192.168.1.1" bash launch_training.sh

Monitoring Training

Monitor your training progress:
# View logs
tail -f checkpoints/training.log

# Monitor GPU usage
watch -n 1 nvidia-smi

Troubleshooting

Try these solutions:
  1. Reduce --per_device_train_batch_size
  2. Increase --gradient_accumulation_steps
  3. Reduce --model_max_length
  4. Enable gradient checkpointing in DeepSpeed config
  5. Use DeepSpeed ZeRO-3 for larger models
Optimize performance:
  1. Enable Flash Attention 2 in config.json
  2. Use --data_packing True with preprocessed data
  3. Ensure --bf16 is enabled on Ampere+ GPUs
  4. Check if GPU utilization is at 100%
  5. Increase batch size if memory allows
If you see NaN losses or diverging training:
  1. Lower the learning rate (try 1e-7 or 5e-8)
  2. Set --tune_mm_vision False when using mixed image/video data
  3. Increase warmup ratio to 0.05 or 0.1
  4. Check data for corrupted images or invalid annotations
  5. Reduce --max_pixels if processing very large images

Build docs developers (and LLMs) love