Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Qwen3-VL provides multiple ways to control the resolution of images and videos, allowing you to balance quality and computational efficiency.

Official Processor Control

Image Processor Configuration

The size['longest_edge'] parameter corresponds to max_pixels, and size['shortest_edge'] corresponds to min_pixels.
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# Budget for image processor
# Since the compression ratio is 32 for Qwen3-VL, we can set the number 
# of visual tokens of a single image to 256-1280 (32× spatial compression)
processor.image_processor.size = {
    "longest_edge": 1280*32*32,  # max_pixels
    "shortest_edge": 256*32*32   # min_pixels
}

Video Processor Configuration

For videos, the parameters control total pixels across all frames.
# Budget for video processor
# Set the number of visual tokens to 256-16384 
# (32× spatial compression + 2× temporal compression)
processor.video_processor.size = {
    "longest_edge": 16384*32*32*2,  # T×H×W must not exceed this
    "shortest_edge": 256*32*32*2    # Minimum total pixel budget
}
Compression Ratios for Qwen3-VL:
  • Images: 32× spatial compression
  • Videos: 32× spatial + 2× temporal compression
For example, setting visual tokens to 1280 means max_pixels = 1280 × 32 × 32 = 1,310,720

Using qwen-vl-utils

The qwen-vl-utils library provides per-input resolution control.

Installation

pip install qwen-vl-utils==0.0.14
# Recommended: Install with decord for faster video loading
pip install qwen-vl-utils[decord]

Basic Setup for Qwen3-VL

from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", 
    dtype="auto", 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")
For Qwen3-VL, set image_patch_size=16 and return_video_metadata=True when using process_vision_info.

Image Resolution Control

Method 1: Exact Dimensions

Specify exact resized_height and resized_width (rounded to nearest multiple of 32):
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = process_vision_info(messages, image_patch_size=16)

# Since qwen-vl-utils has resized the images/videos,
# pass do_resize=False to avoid duplicate operation!
inputs = processor(text=text, images=images, videos=videos, do_resize=False, return_tensors="pt")
inputs = inputs.to(model.device)

Method 2: Min/Max Pixels

Define min_pixels and max_pixels to maintain aspect ratio:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = process_vision_info(messages, image_patch_size=16)

inputs = processor(text=text, images=images, videos=videos, do_resize=False, return_tensors="pt")
inputs = inputs.to(model.device)

Video Resolution Control

Per-Video Configuration

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
                "min_pixels": 4 * 32 * 32,
                "max_pixels": 256 * 32 * 32,
                "total_pixels": 20480 * 32 * 32,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(
    messages, 
    image_patch_size=16, 
    return_video_kwargs=True, 
    return_video_metadata=True
)

# Split videos and metadata
if videos is not None:
    videos, video_metadatas = zip(*videos)
    videos, video_metadatas = list(videos), list(video_metadatas)
else:
    video_metadatas = None

inputs = processor(
    text=text, 
    images=images, 
    videos=videos, 
    video_metadata=video_metadatas, 
    return_tensors="pt", 
    do_resize=False, 
    **video_kwargs
)
inputs = inputs.to(model.device)

Video Parameters

  • min_pixels: Minimum pixels per frame
  • max_pixels: Maximum pixels per frame
  • total_pixels: Total pixel budget across all frames (recommended < 24576 × 32 × 32)
  • fps: Frame sampling rate (e.g., 2.0 for 2 frames per second)

Frame Sampling with qwen-vl-utils

Frame List as Video

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "sample_fps": "1",  # Frame sampling rate for timestamps
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

Video File with FPS Control

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

Important: Avoiding Duplicate Resizing

When using qwen-vl-utils, always set do_resize=False in the processor to avoid duplicate resizing:
inputs = processor(
    text=text, 
    images=images, 
    videos=videos, 
    do_resize=False,  # Critical!
    return_tensors="pt"
)

Complete Example

from transformers import AutoModelForImageTextToText, AutoProcessor
from qwen_vl_utils import process_vision_info

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", 
    dtype="auto", 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference with qwen-vl-utils
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = process_vision_info(messages, image_patch_size=16)

# Set do_resize=False to avoid duplicate resizing
inputs = processor(text=text, images=images, videos=videos, do_resize=False, return_tensors="pt")
inputs = inputs.to(model.device)

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Next Steps

Generation Parameters

Configure temperature, top_p, and sampling parameters

Batch Inference

Process multiple requests efficiently

Build docs developers (and LLMs) love