Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Qwen3-VL introduces advanced 3D grounding capabilities, providing accurate 3D bounding boxes for both indoor and outdoor objects. This enables spatial reasoning and supports embodied AI applications.

Capability Overview

The 3D grounding feature enables you to:
  • Generate accurate 3D bounding boxes
  • Handle both indoor and outdoor scenes
  • Support spatial reasoning tasks
  • Enable embodied AI applications
  • Understand depth and spatial relationships
  • Provide position, viewpoint, and occlusion information

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/scene.jpg",
            },
            {"type": "text", "text": "Provide 3D bounding boxes for the objects in this scene."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full 3D grounding cookbook with interactive examples: Open in Colab View on GitHub

Build docs developers (and LLMs) love