Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Function Signature

def process_vision_info(
    conversations: Union[List[Dict[str, Any]], List[List[Dict[str, Any]]]],
    return_video_kwargs: bool = False,
    return_video_metadata: bool = False,
    image_patch_size: int = 14,
) -> Tuple[Optional[List[Image.Image]], Optional[List[Union[torch.Tensor, List[Image.Image]]]], Optional[Dict[str, Any]]]

Description

The main function for extracting and processing all vision information (images and videos) from conversation messages. It automatically detects vision content, loads the media, applies smart resizing, and returns processed inputs ready for model processors.

Parameters

conversations
Union[List[Dict[str, Any]], List[List[Dict[str, Any]]]]
required
Conversation messages containing vision content. Can be:
  • A single conversation: [{"role": "user", "content": [...]}]
  • Multiple conversations: [[{"role": "user", "content": [...]}], ...]
Each message content should contain dictionaries with type and corresponding media keys:
  • Images: {"type": "image", "image": "path/url/base64"}
  • Videos: {"type": "video", "video": "path/url"}
return_video_kwargs
bool
default:"False"
Whether to return video-specific keyword arguments (like fps) for the processor.Required for Qwen2.5VL models that need video processing parameters.
return_video_metadata
bool
default:"False"
Whether to return video metadata (fps, frame indices, total frames, backend) alongside video tensors.Required for Qwen3VL models that need detailed video metadata.
image_patch_size
int
default:"14"
The patch size used by the vision encoder. Affects resizing calculations.Common values:
  • 14 for Qwen2VL and Qwen2.5VL
  • 16 for Qwen3VL

Returns

image_inputs
Optional[List[PIL.Image.Image]]
List of processed PIL Image objects, or None if no images found.Images are resized using smart_resize to optimal dimensions based on min_pixels and max_pixels.
video_inputs
Optional[List[Union[torch.Tensor, Tuple[torch.Tensor, Dict]]]]
List of processed video tensors with shape (T, C, H, W), or None if no videos found.If return_video_metadata=True, each item is a tuple of (video_tensor, metadata_dict).
video_kwargs
Optional[Dict[str, Any]]
Dictionary of video processing parameters (only returned if return_video_kwargs=True).Contains:
  • do_sample_frames: Always False
  • fps: List of sample fps values (only if return_video_metadata=False)

Usage Examples

Qwen2VL

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

processor = AutoProcessor.from_pretrained(model_path)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos = process_vision_info(messages)
inputs = processor(text=text, images=images, videos=videos, padding=True, return_tensors="pt")

generated_ids = model.generate(**inputs)

Qwen2.5VL

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "file:///path/to/video.mp4", "fps": 2.0},
            {"type": "text", "text": "Describe this video."}
        ]
    }
]

processor = AutoProcessor.from_pretrained(model_path)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=text, images=images, videos=videos, 
    padding=True, return_tensors="pt", **video_kwargs
)

generated_ids = model.generate(**inputs)

Qwen3VL

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "file:///path/to/video.mp4"},
            {"type": "text", "text": "Describe this video."}
        ]
    }
]

processor = AutoProcessor.from_pretrained(model_path)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path, dtype="auto", device_map="auto"
)

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
images, videos, video_kwargs = process_vision_info(
    messages, image_patch_size=16, return_video_kwargs=True, return_video_metadata=True
)

if videos is not None:
    videos, video_metadatas = zip(*videos)
    videos, video_metadatas = list(videos), list(video_metadatas)
else:
    video_metadatas = None

inputs = processor(
    text=text, images=images, videos=videos, 
    video_metadata=video_metadatas, return_tensors="pt", 
    do_resize=False, **video_kwargs
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs)

Vision Content Formats

Image Content

# Local file
{"type": "image", "image": "file:///path/to/image.jpg"}

# URL
{"type": "image", "image": "http://example.com/image.jpg"}

# Base64
{"type": "image", "image": "data:image;base64,/9j/..."}

# PIL Image
{"type": "image", "image": pil_image}

# With custom dimensions
{
    "type": "image",
    "image": "file:///path/to/image.jpg",
    "resized_height": 280,
    "resized_width": 420
}

Video Content

# Video file
{"type": "video", "video": "file:///path/to/video.mp4"}

# Frame sequence
{
    "type": "video",
    "video": [
        "file:///path/to/frame1.jpg",
        "file:///path/to/frame2.jpg",
        "file:///path/to/frame3.jpg"
    ]
}

# With custom parameters
{
    "type": "video",
    "video": "file:///path/to/video.mp4",
    "fps": 2.0,
    "resized_height": 280,
    "resized_width": 280
}

See Also

Build docs developers (and LLMs) love