Quick Start

Get up and running with Qwen3-VL in just a few steps. This guide will walk you through your first image inference.

Installation

Install Transformers

Qwen3-VL requires transformers 4.57.0 or higher:

pip install "transformers>=4.57.0"

Install Optional Dependencies

For optimal performance, install these recommended packages:

pip install accelerate torch torchvision

Your First Inference

Here’s a complete example to perform image understanding with Qwen3-VL:

from transformers import AutoModelForImageTextToText, AutoProcessor

# Load the model on available devices
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct", 
    dtype="auto", 
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-8B-Instruct")

# Prepare your message with an image
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Prepare inputs for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, 
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)
print(output_text)

We recommend starting with the 8B model for development. For production, see the Model Variants guide.

Image Input Formats

Qwen3-VL supports multiple image input formats:

URL
Local File
Base64

{
    "type": "image",
    "image": "https://example.com/image.jpg"
}

{
    "type": "image",
    "image": "file:///path/to/image.jpg"
}

{
    "type": "image",
    "image": "data:image;base64,/9j/4AAQSkZJRg..."
}

Model Selection

Choose a model size based on your use case:

2B / 4B Models

Edge deployment - Run on consumer GPUs or mobile devices

8B Model

Balanced - Best for most applications (24GB VRAM)

32B Model

High performance - Production use cases (80GB VRAM)

235B MoE Model

Maximum capability - Research and specialized tasks (8x80GB)

Performance Optimization

For better performance, enable Flash Attention 2:

pip install flash-attn --no-build-isolation

Then load the model with:

import torch
from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-8B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)

Flash Attention 2 is especially beneficial for multi-image and video scenarios, reducing memory usage and improving speed.

Next Steps

Image Processing

Learn about multi-image inference and resolution control

Video Processing

Process videos with frame sampling

Capabilities

Explore OCR, grounding, document parsing, and more

Deployment

Deploy with vLLM or SGLang for production

Get Started

Core Concepts

Inference

Deployment

Fine-tuning

Capabilities

Installation

Your First Inference

Image Input Formats

Model Selection

2B / 4B Models

8B Model

32B Model

235B MoE Model

Performance Optimization

Next Steps

Image Processing

Video Processing

Capabilities

Deployment

Build docs developers (and LLMs) love

Get Started

Core Concepts

Inference

Deployment

Fine-tuning

Capabilities

Documentation Index

​Installation

​Your First Inference

​Image Input Formats

​Model Selection

2B / 4B Models

8B Model

32B Model

235B MoE Model

​Performance Optimization

​Next Steps

Image Processing

Video Processing

Capabilities

Deployment

Build docs developers (and LLMs) love

Installation

Your First Inference

Image Input Formats

Model Selection

Performance Optimization

Next Steps