Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Multi-Image Inference

Qwen3-VL can process multiple images in a single request, making it ideal for image comparison and analysis tasks.
from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", 
    dtype="auto", 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# Messages containing multiple images and a text query
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Image Input Formats

Supported formats:
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

Resolution Control with Processor

Control image resolution using the processor’s size parameter:
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# Budget for image processor
# Since the compression ratio is 32 for Qwen3-VL, we can set the number 
# of visual tokens to 256-1280 (32× spatial compression)
processor.image_processor.size = {
    "longest_edge": 1280*32*32, 
    "shortest_edge": 256*32*32
}

Understanding Size Parameters

  • longest_edge (max_pixels): Maximum number of pixels allowed (H × W ≤ max_pixels)
  • shortest_edge (min_pixels): Minimum allowable pixel count
  • For Qwen3-VL: 32× spatial compression ratio
The size parameters control the visual token budget. Adjust based on your GPU memory and quality requirements.

Adding Vision IDs

For better reference in multi-image scenarios, add labels to images:
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Can you describe these images?"},
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "These are from my vacation."},
        ],
    },
]

# Add vision IDs for better reference
prompt_with_id = processor.apply_chat_template(
    conversation, 
    add_generation_prompt=True, 
    add_vision_id=True
)
# Output: "Can you describe these images?Picture 1: <|vision_start|>..."

Performance Tips

For multi-image scenarios, enable flash_attention_2 for better memory efficiency:
import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

Next Steps

Pixel Control

Advanced resolution control with qwen-vl-utils

Batch Inference

Process multiple requests efficiently

Build docs developers (and LLMs) love