Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
Qwen3-VL provides advanced spatial perception capabilities, enabling the model to see, understand, and reason about spatial information. This includes judging object positions, viewpoints, occlusions, and spatial relationships.
Capability Overview
The spatial understanding feature enables you to:
- Judge object positions and locations
- Understand viewpoints and perspectives
- Detect and reason about occlusions
- Analyze spatial relationships between objects
- Provide depth and distance estimation
- Support embodied AI and robotics applications
Example Usage
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/scene.jpg",
},
{"type": "text", "text": "Describe the spatial relationships between objects in this scene."},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Try it Yourself
Explore the full spatial understanding cookbook with interactive examples:
View on GitHub