Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
Qwen3-VL provides precise 2D object grounding capabilities using relative position coordinates. The model supports both bounding boxes and points, enabling diverse combinations of positioning and labeling tasks.
Capability Overview
The 2D grounding feature enables you to:
- Locate objects using bounding boxes
- Pinpoint specific locations with points
- Use relative position coordinates
- Combine multiple grounding formats
- Perform object detection and localization
- Support diverse positioning tasks
Example Usage
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/your/image.jpg",
},
{"type": "text", "text": "Locate all the objects in this image with bounding boxes."},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Try it Yourself
Explore the full 2D grounding cookbook with interactive examples:
View on GitHub