Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Get up and running with Qwen3-VL in just a few steps. This guide will walk you through your first image inference.

Installation

1

Install Transformers

Qwen3-VL requires transformers 4.57.0 or higher:
pip install "transformers>=4.57.0"
2

Install Optional Dependencies

For optimal performance, install these recommended packages:
pip install accelerate torch torchvision

Your First Inference

Here’s a complete example to perform image understanding with Qwen3-VL:
from transformers import AutoModelForImageTextToText, AutoProcessor

# Load the model on available devices
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct", 
    dtype="auto", 
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-8B-Instruct")

# Prepare your message with an image
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Prepare inputs for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)
print(output_text)
We recommend starting with the 8B model for development. For production, see the Model Variants guide.

Image Input Formats

Qwen3-VL supports multiple image input formats:
{
    "type": "image",
    "image": "https://example.com/image.jpg"
}

Model Selection

Choose a model size based on your use case:

2B / 4B Models

Edge deployment - Run on consumer GPUs or mobile devices

8B Model

Balanced - Best for most applications (24GB VRAM)

32B Model

High performance - Production use cases (80GB VRAM)

235B MoE Model

Maximum capability - Research and specialized tasks (8x80GB)

Performance Optimization

For better performance, enable Flash Attention 2:
pip install flash-attn --no-build-isolation
Then load the model with:
import torch
from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)
Flash Attention 2 is especially beneficial for multi-image and video scenarios, reducing memory usage and improving speed.

Next Steps

Image Processing

Learn about multi-image inference and resolution control

Video Processing

Process videos with frame sampling

Capabilities

Explore OCR, grounding, document parsing, and more

Deployment

Deploy with vLLM or SGLang for production

Build docs developers (and LLMs) love