Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
JSON Data Structure
All training data must follow a standardized JSON format with the following components:
Required Fields
- Media field: Either
image or video containing the path to media files
- conversations: An array of conversation turns with
from and value fields
Use special tokens in prompts to indicate media placement:
<image> for image understanding tasks
<video> for video understanding tasks
Important formatting rules:
- One
<image> tag must correspond to exactly one image file
- Similarly,
<video> tags must correspond to video files
- These special tokens should NOT appear in the answer text
Single Image Example
Basic single-image question-answering format:
{
"image": "images/001.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat's the main object in this picture?"
},
{
"from": "gpt",
"value": "A red apple on a wooden table"
}
]
}
Multi-Image Example
For tasks requiring comparison or reasoning across multiple images:
{
"image": ["cats/001.jpg", "cats/002.jpg"],
"conversations": [
{
"from": "human",
"value": "<image>\n<image>\nWhat are the differences between these two cats?"
},
{
"from": "gpt",
"value": "The first cat is an orange tabby with short fur and green eyes, while the second is a gray Siamese with blue eyes and pointed coloration. They also appear to be in different environments - the first is indoors on a couch, the second is outdoors in a garden."
}
]
}
The number of <image> tags in the prompt must match the number of images in the image array.
Video Example
For video understanding and temporal reasoning tasks:
{
"video": "videos/005.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\nWhat caused the blue object to move?\nOptions:\n(A) Gravity\n(B) Collision\n(C) Magnetic force"
},
{
"from": "gpt",
"value": "Answer: (B) Collision"
}
]
}
Grounding Example
For object detection and localization tasks:
{
"image": "demo/COCO_train2014_000000580957.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nLocate house in this image and output the bbox coordinates in JSON format."
},
{
"from": "gpt",
"value": "{\n\"bbox_2d\": [135, 114, 1016, 672]\n}"
}
]
}
If you have grounding data in a different format, use the tools/process_bbox.ipynb notebook to transform your data into the QwenVL format.
Packed Data Example
For improved training efficiency, multiple samples can be packed into a single sequence:
[
{
"image": "images/001.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat's the main object in this picture?"
},
{
"from": "gpt",
"value": "A red apple on a wooden table"
}
]
},
{
"image": "images/002.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat's the main object in this picture?"
},
{
"from": "gpt",
"value": "A green orange on a plastic table"
}
]
}
]
To use packed data, preprocess your dataset with tools/pack_data.py and set data_packing=True in your training configuration.
File Organization
Your dataset can be organized in two ways:
Option 1: Relative Paths with data_path
# In your dataset config
{
"annotation_path": "/data/annotations.json",
"data_path": "/data/images/" # Base path for media files
}
// In your JSON file
{
"image": "001.jpg", // Will be resolved to /data/images/001.jpg
...
}
Option 2: Absolute Paths
# In your dataset config
{
"annotation_path": "/data/annotations.json",
"data_path": "" # Empty if using absolute paths
}
// In your JSON file
{
"image": "/absolute/path/to/images/001.jpg",
...
}
Data Validation
For open-source datasets that might have missing images or other issues, verify data completeness using:
python tools/check_image.py --annotation_path /path/to/annotations.json
Available Public Datasets
You can use these datasets directly without additional processing:
nyu-visionx/Cambrian-10M
lmms-lab/LLaVA-NeXT-Data
FreedomIntelligence/ALLaVA-4V
TIGER-Lab/VisualWebInstruct
Example Files
The repository includes example datasets:
demo/single_images.json - Single image examples
demo/video.json - Video understanding examples
These files can be used directly for training or as templates for your own datasets.