Processor

AutoProcessor.from_pretrained

Load the processor for tokenizing text and preprocessing images/videos.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

Parameters

pretrained_model_name_or_path

string

required

Model identifier from Hugging Face Hub or local path. Should match the model being used.

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

Returns

Returns a Qwen3VLProcessor instance that handles both text tokenization and image/video preprocessing.

processor.apply_chat_template

Format conversation messages into model inputs with proper tokenization and vision preprocessing.

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)

Parameters

messages

list[dict]

required

Conversation messages in chat format. Each message contains role and content.Message Format:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://example.com/image.jpg"
            },
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

Content Types:

{"type": "text", "text": "..."}
{"type": "image", "image": "url|path|base64"}
{"type": "video", "video": "url|path"}

Image Sources:

URL: "https://example.com/image.jpg"
Local file: "file:///path/to/image.jpg"
Base64: "data:image;base64,/9j/..."

Video Sources:

URL: "https://example.com/video.mp4"
Local file: "file:///path/to/video.mp4"
Frame list: ["file:///frame1.jpg", "file:///frame2.jpg", ...]

tokenize

bool

default:"True"

Whether to tokenize the text into input IDs.

True: Returns tokenized tensors ready for model input
False: Returns formatted text string

# Get tokenized inputs
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt"
)

# Get text only
text = processor.apply_chat_template(
    messages,
    tokenize=False
)

add_generation_prompt

bool

default:"False"

Whether to add the assistant prompt token at the end for generation.Should be True when generating responses, False when formatting training data.

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True  # Add <|im_start|>assistant\n
)

return_dict

bool

default:"False"

Whether to return a dictionary with named fields.

True: Returns dict with input_ids, attention_mask, pixel_values, etc.
False: Returns tuple

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True
)
# Access: inputs.input_ids, inputs.pixel_values, etc.

return_tensors

string

Type of tensors to return.

"pt": PyTorch tensors (most common)
"np": NumPy arrays
None: Python lists

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt"
)

padding

bool | string

default:"False"

Padding strategy for batch processing.

True: Pad to longest sequence in batch
False: No padding
"max_length": Pad to model’s max length

Required for batch generation.

# Batch processing
processor.tokenizer.padding_side = 'left'
inputs = processor.apply_chat_template(
    batch_messages,
    tokenize=True,
    return_tensors="pt",
    padding=True
)

add_vision_id

bool

default:"False"

Whether to add labels (“Picture 1:”, “Video 1:”) to visual inputs.Helpful when processing multiple images/videos for better reference.

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_vision_id=True
)
# Output includes: "Picture 1: <image> Picture 2: <image> Video 1: <video>"

fps

float

default:"2"

Frame sampling rate for videos (frames per second).

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    fps=4  # Sample 4 frames per second
)

num_frames

int

Exact number of frames to sample from video. Overrides fps if set.

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_tensors="pt",
    num_frames=128,
    fps=None  # Must set fps=None when using num_frames
)

Returns

input_ids

torch.Tensor

Tokenized input text.Shape: (batch_size, sequence_length)

attention_mask

torch.Tensor

Mask indicating which tokens should be attended to.Shape: (batch_size, sequence_length)

pixel_values

torch.Tensor

Preprocessed image data (if images present in messages).Shape: (num_images, channels, height, width)

video_grid_thw

torch.Tensor

Video grid dimensions: temporal, height, width (if videos present).

Complete Examples

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
            },
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)

inputs = inputs.to(model.device)

Advanced: Controlling Visual Token Budget

Control the number of visual tokens by adjusting pixel budgets:

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct"
)

# Image processor: 256-1280 visual tokens (32x spatial compression)
processor.image_processor.size = {
    "longest_edge": 1280 * 32 * 32,
    "shortest_edge": 256 * 32 * 32
}

# Video processor: 256-16384 visual tokens (32x spatial + 2x temporal compression)
processor.video_processor.size = {
    "longest_edge": 16384 * 32 * 32 * 2,
    "shortest_edge": 256 * 32 * 32 * 2
}

Notes

Message Format Requirements:

Each message must have "role" ("user", "assistant", or "system")
Content can be a string or list of content items
Content items must specify "type" ("text", "image", or "video")

For batch generation:

Set processor.tokenizer.padding_side = 'left'
Include padding=True in apply_chat_template()

When using add_vision_id=True, images and videos are labeled as “Picture 1:”, “Picture 2:”, “Video 1:”, etc. This helps the model reference specific visual inputs in its response.

Model API

qwen-vl-utils

Training API

AutoProcessor.from_pretrained

Parameters

Returns

processor.apply_chat_template

Parameters

Returns

Complete Examples

Advanced: Controlling Visual Token Budget

Notes

Build docs developers (and LLMs) love

Model API

qwen-vl-utils

Training API

Documentation Index

​AutoProcessor.from_pretrained

​Parameters

​Returns

​processor.apply_chat_template

​Parameters

​Returns

​Complete Examples

​Advanced: Controlling Visual Token Budget

​Notes

Build docs developers (and LLMs) love

AutoProcessor.from_pretrained

Parameters

Returns

processor.apply_chat_template

Parameters

Returns

Complete Examples

Advanced: Controlling Visual Token Budget

Notes