Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

AutoProcessor.from_pretrained

Load the processor for tokenizing text and preprocessing images/videos.
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

Parameters

pretrained_model_name_or_path
string
required
Model identifier from Hugging Face Hub or local path. Should match the model being used.
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

Returns

Returns a Qwen3VLProcessor instance that handles both text tokenization and image/video preprocessing.

processor.apply_chat_template

Format conversation messages into model inputs with proper tokenization and vision preprocessing.
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)

Parameters

messages
list[dict]
required
Conversation messages in chat format. Each message contains role and content.Message Format:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://example.com/image.jpg"
            },
            {"type": "text", "text": "Describe this image."}
        ]
    }
]
Content Types:
  • {"type": "text", "text": "..."}
  • {"type": "image", "image": "url|path|base64"}
  • {"type": "video", "video": "url|path"}
Image Sources:
  • URL: "https://example.com/image.jpg"
  • Local file: "file:///path/to/image.jpg"
  • Base64: "data:image;base64,/9j/..."
Video Sources:
  • URL: "https://example.com/video.mp4"
  • Local file: "file:///path/to/video.mp4"
  • Frame list: ["file:///frame1.jpg", "file:///frame2.jpg", ...]
tokenize
bool
default:"True"
Whether to tokenize the text into input IDs.
  • True: Returns tokenized tensors ready for model input
  • False: Returns formatted text string
# Get tokenized inputs
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt"
)

# Get text only
text = processor.apply_chat_template(
    messages,
    tokenize=False
)
add_generation_prompt
bool
default:"False"
Whether to add the assistant prompt token at the end for generation.Should be True when generating responses, False when formatting training data.
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True  # Add <|im_start|>assistant\n
)
return_dict
bool
default:"False"
Whether to return a dictionary with named fields.
  • True: Returns dict with input_ids, attention_mask, pixel_values, etc.
  • False: Returns tuple
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True
)
# Access: inputs.input_ids, inputs.pixel_values, etc.
return_tensors
string
Type of tensors to return.
  • "pt": PyTorch tensors (most common)
  • "np": NumPy arrays
  • None: Python lists
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt"
)
padding
bool | string
default:"False"
Padding strategy for batch processing.
  • True: Pad to longest sequence in batch
  • False: No padding
  • "max_length": Pad to model’s max length
Required for batch generation.
# Batch processing
processor.tokenizer.padding_side = 'left'
inputs = processor.apply_chat_template(
    batch_messages,
    tokenize=True,
    return_tensors="pt",
    padding=True
)
add_vision_id
bool
default:"False"
Whether to add labels (“Picture 1:”, “Video 1:”) to visual inputs.Helpful when processing multiple images/videos for better reference.
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_vision_id=True
)
# Output includes: "Picture 1: <image> Picture 2: <image> Video 1: <video>"
fps
float
default:"2"
Frame sampling rate for videos (frames per second).
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    fps=4  # Sample 4 frames per second
)
num_frames
int
Exact number of frames to sample from video. Overrides fps if set.
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    num_frames=128,
    fps=None  # Must set fps=None when using num_frames
)

Returns

input_ids
torch.Tensor
Tokenized input text.Shape: (batch_size, sequence_length)
attention_mask
torch.Tensor
Mask indicating which tokens should be attended to.Shape: (batch_size, sequence_length)
pixel_values
torch.Tensor
Preprocessed image data (if images present in messages).Shape: (num_images, channels, height, width)
video_grid_thw
torch.Tensor
Video grid dimensions: temporal, height, width (if videos present).

Complete Examples

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
            },
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)

inputs = inputs.to(model.device)

Advanced: Controlling Visual Token Budget

Control the number of visual tokens by adjusting pixel budgets:
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

# Image processor: 256-1280 visual tokens (32x spatial compression)
processor.image_processor.size = {
    "longest_edge": 1280 * 32 * 32,
    "shortest_edge": 256 * 32 * 32
}

# Video processor: 256-16384 visual tokens (32x spatial + 2x temporal compression)
processor.video_processor.size = {
    "longest_edge": 16384 * 32 * 32 * 2,
    "shortest_edge": 256 * 32 * 32 * 2
}

Notes

Message Format Requirements:
  • Each message must have "role" ("user", "assistant", or "system")
  • Content can be a string or list of content items
  • Content items must specify "type" ("text", "image", or "video")
For batch generation:
  1. Set processor.tokenizer.padding_side = 'left'
  2. Include padding=True in apply_chat_template()
When using add_vision_id=True, images and videos are labeled as “Picture 1:”, “Picture 2:”, “Video 1:”, etc. This helps the model reference specific visual inputs in its response.

Build docs developers (and LLMs) love