Batch Inference

Overview

Batch inference allows you to process multiple requests simultaneously, improving throughput and efficiency.

Basic Batch Inference

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct", 
    dtype="auto", 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# For batch generation, padding_side should be set to left!
processor.tokenizer.padding_side = 'left'

# Sample messages for batch inference
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]

messages2 = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
    {"role": "user", "content": [{"type": "text", "text": "Who are you?"}]},
]

# Combine messages for batch processing
messages = [messages1, messages2]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    padding=True  # padding should be set for batch generation!
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Key Considerations

Important: For batch generation, you must:

Set padding_side = 'left' on the tokenizer
Enable padding=True in apply_chat_template

Padding Configuration

# Required: Set padding to left side
processor.tokenizer.padding_side = 'left'

# Required: Enable padding in template
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    padding=True  # Must be True for batching
)

Mixed Content Batching

You can batch requests with different content types:

# Batch 1: Multiple images
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Compare these images."},
        ],
    }
]

# Batch 2: Single image
messages2 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image3.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Batch 3: Text only
messages3 = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "What is the capital of France?"}],
    }
]

# Process all together
messages = [messages1, messages2, messages3]

Batch with Video Content

# Batch with video and image
messages1 = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video.mp4",
            },
            {"type": "text", "text": "Summarize this video."},
        ],
    }
]

messages2 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

processor.tokenizer.padding_side = 'left'
messages = [messages1, messages2]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    padding=True
)

Performance Tips

Optimization Recommendations:

Group similar requests: Batch requests with similar lengths to minimize padding overhead
Use flash_attention_2: Significantly improves batch processing speed
Adjust batch size: Balance between throughput and memory usage
Monitor GPU memory: Larger batches require more VRAM

Enable Flash Attention

import torch

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-235B-A22B-Instruct",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

Memory Considerations

Batch size impacts memory usage:

Small batches (2-4): Better for mixed content types
Medium batches (4-8): Good balance for similar requests
Large batches (8+): Best for uniform, text-only requests

Video content uses significantly more memory than images. Reduce batch size when processing videos.

Error Handling

try:
    generated_ids = model.generate(**inputs, max_new_tokens=128)
except RuntimeError as e:
    if "out of memory" in str(e):
        print("Reduce batch size or use smaller images/videos")
    raise

Get Started

Core Concepts

Inference

Deployment

Fine-tuning

Capabilities

Overview

Basic Batch Inference

Key Considerations

Padding Configuration

Mixed Content Batching

Batch with Video Content

Performance Tips

Enable Flash Attention

Memory Considerations

Error Handling

Next Steps

Generation Parameters

Pixel Control

Build docs developers (and LLMs) love

Get Started

Core Concepts

Inference

Deployment

Fine-tuning

Capabilities

Documentation Index

​Overview

​Basic Batch Inference

​Key Considerations

​Padding Configuration

​Mixed Content Batching

​Batch with Video Content

​Performance Tips

​Enable Flash Attention

​Memory Considerations

​Error Handling

​Next Steps

Generation Parameters

Pixel Control

Build docs developers (and LLMs) love

Overview

Basic Batch Inference

Key Considerations

Padding Configuration

Mixed Content Batching

Batch with Video Content

Performance Tips

Enable Flash Attention

Memory Considerations

Error Handling

Next Steps