Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Batch inference allows you to process multiple requests simultaneously, improving throughput and efficiency.

Basic Batch Inference

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", 
    dtype="auto", 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# For batch generation, padding_side should be set to left!
processor.tokenizer.padding_side = 'left'

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]

messages2 = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user", "content": [{"type": "text", "text": "Who are you?"}]},
]

# Combine messages for batch processing
messages = [messages1, messages2]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    padding=True  # padding should be set for batch generation!
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Key Considerations

Important: For batch generation, you must:
  1. Set padding_side = 'left' on the tokenizer
  2. Enable padding=True in apply_chat_template

Padding Configuration

# Required: Set padding to left side
processor.tokenizer.padding_side = 'left'

# Required: Enable padding in template
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    padding=True  # Must be True for batching
)

Mixed Content Batching

You can batch requests with different content types:
# Batch 1: Multiple images
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Compare these images."},
        ],
    }
]

# Batch 2: Single image
messages2 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image3.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Batch 3: Text only
messages3 = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "What is the capital of France?"}],
    }
]

# Process all together
messages = [messages1, messages2, messages3]

Batch with Video Content

# Batch with video and image
messages1 = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video.mp4",
            },
            {"type": "text", "text": "Summarize this video."},
        ],
    }
]

messages2 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

processor.tokenizer.padding_side = 'left'
messages = [messages1, messages2]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    padding=True
)

Performance Tips

Optimization Recommendations:
  1. Group similar requests: Batch requests with similar lengths to minimize padding overhead
  2. Use flash_attention_2: Significantly improves batch processing speed
  3. Adjust batch size: Balance between throughput and memory usage
  4. Monitor GPU memory: Larger batches require more VRAM

Enable Flash Attention

import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

Memory Considerations

Batch size impacts memory usage:
  • Small batches (2-4): Better for mixed content types
  • Medium batches (4-8): Good balance for similar requests
  • Large batches (8+): Best for uniform, text-only requests
Video content uses significantly more memory than images. Reduce batch size when processing videos.

Error Handling

try:
    generated_ids = model.generate(**inputs, max_new_tokens=128)
except RuntimeError as e:
    if "out of memory" in str(e):
        print("Reduce batch size or use smaller images/videos")
    raise

Next Steps

Generation Parameters

Configure sampling parameters for better outputs

Pixel Control

Optimize memory usage with resolution control

Build docs developers (and LLMs) love