Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
AutoProcessor.from_pretrained
Load the processor for tokenizing text and preprocessing images/videos.Parameters
Model identifier from Hugging Face Hub or local path. Should match the model being used.
Returns
Returns aQwen3VLProcessor instance that handles both text tokenization and image/video preprocessing.
processor.apply_chat_template
Format conversation messages into model inputs with proper tokenization and vision preprocessing.Parameters
Conversation messages in chat format. Each message contains Content Types:
role and content.Message Format:{"type": "text", "text": "..."}{"type": "image", "image": "url|path|base64"}{"type": "video", "video": "url|path"}
- URL:
"https://example.com/image.jpg" - Local file:
"file:///path/to/image.jpg" - Base64:
"data:image;base64,/9j/..."
- URL:
"https://example.com/video.mp4" - Local file:
"file:///path/to/video.mp4" - Frame list:
["file:///frame1.jpg", "file:///frame2.jpg", ...]
Whether to tokenize the text into input IDs.
True: Returns tokenized tensors ready for model inputFalse: Returns formatted text string
Whether to add the assistant prompt token at the end for generation.Should be
True when generating responses, False when formatting training data.Whether to return a dictionary with named fields.
True: Returns dict withinput_ids,attention_mask,pixel_values, etc.False: Returns tuple
Type of tensors to return.
"pt": PyTorch tensors (most common)"np": NumPy arraysNone: Python lists
Padding strategy for batch processing.
True: Pad to longest sequence in batchFalse: No padding"max_length": Pad to model’s max length
Whether to add labels (“Picture 1:”, “Video 1:”) to visual inputs.Helpful when processing multiple images/videos for better reference.
Frame sampling rate for videos (frames per second).
Exact number of frames to sample from video. Overrides
fps if set.Returns
Tokenized input text.Shape:
(batch_size, sequence_length)Mask indicating which tokens should be attended to.Shape:
(batch_size, sequence_length)Preprocessed image data (if images present in messages).Shape:
(num_images, channels, height, width)Video grid dimensions: temporal, height, width (if videos present).
Complete Examples
Advanced: Controlling Visual Token Budget
Control the number of visual tokens by adjusting pixel budgets:Notes
Message Format Requirements:
- Each message must have
"role"("user","assistant", or"system") - Content can be a string or list of content items
- Content items must specify
"type"("text","image", or"video")