Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Qwen3-VL provides precise 2D object grounding capabilities using relative position coordinates. The model supports both bounding boxes and points, enabling diverse combinations of positioning and labeling tasks.

Capability Overview

The 2D grounding feature enables you to:
  • Locate objects using bounding boxes
  • Pinpoint specific locations with points
  • Use relative position coordinates
  • Combine multiple grounding formats
  • Perform object detection and localization
  • Support diverse positioning tasks

Example Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/your/image.jpg",
            },
            {"type": "text", "text": "Locate all the objects in this image with bounding boxes."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Try it Yourself

Explore the full 2D grounding cookbook with interactive examples: Open in Colab View on GitHub

Build docs developers (and LLMs) love