Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
ModelArguments
Configuration for model initialization and component fine-tuning.Path to pretrained model or model identifier from Hugging Face Hub. Supports:
Qwen/Qwen2-VL-*Qwen/Qwen2.5-VL-*Qwen/Qwen3-VL-*Qwen/Qwen3-VL-MoE-*- Local paths to saved models
Whether to fine-tune the language model (LLM) component. When
False, the LLM parameters are frozen during training.Whether to fine-tune the multimodal projector (MLP/merger) component. This module projects vision features into the language model space.
Whether to fine-tune the vision tower component. When
False, the vision encoder parameters are frozen.Usage Example
DataArguments
Configuration for data processing and vision-language inputs.Comma-separated list of dataset names to use for training. Dataset names should be registered in the data list configuration.Example:
"vqa,caption,ocr"Whether to flatten data sequences for packed training. When enabled, multiple sequences can be packed into a single training example for efficiency.
Whether to enable data packing. Packs multiple examples together to minimize padding and improve GPU utilization.
Base interval for vision processing grid calculations.
Image Processing Parameters
Maximum number of pixels for image inputs. Default is
28 * 28 * 576 = 451,584 pixels.Controls the maximum resolution after dynamic resolution processing.Minimum number of pixels for image inputs. Default is
28 * 28 * 16 = 12,544 pixels.Controls the minimum resolution for image processing.Video Processing Parameters
Maximum number of frames to extract from video inputs. Videos longer than this will be sampled.
Minimum number of frames to extract from video inputs.
Maximum number of pixels per frame for video inputs. Default is
1024 * 28 * 28 = 802,816 pixels.Minimum number of pixels per frame for video inputs. Default is
256 * 28 * 28 = 200,704 pixels.Target frames per second for video sampling. Videos will be resampled to this FPS before frame extraction.
Usage Example
TrainingArguments
Extendstransformers.TrainingArguments with additional parameters for vision-language model training.
Base Parameters
Directory to store downloaded models and datasets cache.
Optimizer to use. Options include:
adamw_torch- PyTorch AdamWadamw_hf- Hugging Face AdamWsgd- Stochastic Gradient Descentadafactor- Memory-efficient Adafactor
Maximum sequence length. Sequences will be right-padded and truncated to this length.Consider increasing for long-form VQA or detailed image descriptions.
Component-Specific Learning Rates
Learning rate for the multimodal projector (merger) module. When set, overrides the base learning rate for projector parameters.Typical values:
1e-4 to 5e-4 (often higher than base LR).Learning rate for the vision tower. When set, overrides the base learning rate for vision encoder parameters.Typical values:
1e-6 to 1e-5 (often lower than base LR to preserve pretrained features).LoRA Configuration
Whether to use LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.
LoRA rank. Higher values provide more capacity but increase trainable parameters.Typical values: 8, 16, 32, 64, 128
LoRA scaling parameter. Controls the magnitude of LoRA updates.Often set to 2x the LoRA rank (e.g.,
lora_alpha = 2 * lora_r).Dropout probability for LoRA layers. Can help with regularization.Typical values: 0.0 to 0.1