vLLM Deployment

Overview

We recommend using vLLM for fast Qwen3-VL deployment and inference. You need to install vllm>=0.11.0 to enable Qwen3-VL support. For more details, check the vLLM official documentation for multimodal models.

Installation

Install the required dependencies:

pip install accelerate
pip install qwen-vl-utils==0.0.14
# Install the latest version of vLLM 'vllm>=0.11.0'
uv pip install -U vllm

Online Serving

You can start a vLLM server to serve models efficiently with an OpenAI-style API.

The following launch command is applicable to H100/H200. For more efficient deployment or deployment on other GPUs, please refer to the vLLM community guide.

Starting the vLLM Server

# Efficient inference with FP8 checkpoint
# Requires NVIDIA H100+ and CUDA 12+
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel \
  --async-scheduling \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --host 0.0.0.0 \
  --port 22002

Image Request Example

Once the server is running, you can make requests using the OpenAI client:

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:22002/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                }
            },
            {
                "type": "text",
                "text": "Read all the text in the image."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Instruct-FP8",
    messages=messages,
    max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

Video Request Example

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:22002/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4"
                }
            },
            {
                "type": "text",
                "text": "How long is this video?"
            }
        ]
    }
]

start = time.time()

# When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`,
# video frame sampling can be configured via `extra_body` (e.g., by setting `fps`).
# This feature is currently supported only in vLLM.
#
# By default, `fps=2` and `do_sample_frames=True`.
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Instruct-FP8",
    messages=messages,
    max_tokens=2048,
    extra_body={"mm_processor_kwargs": {"fps": 2, "do_sample_frames": True}}
)

print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

Offline Inference

You can also use vLLM for local offline inference without running a server:

# -*- coding: utf-8 -*-
import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor
from vllm import LLM, SamplingParams

import os
os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'

def prepare_inputs_for_vllm(messages, processor):
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    # qwen_vl_utils 0.0.14+ required
    image_inputs, video_inputs, video_kwargs = process_vision_info(
        messages,
        image_patch_size=processor.image_processor.patch_size,
        return_video_kwargs=True,
        return_video_metadata=True
    )
    print(f"video_kwargs: {video_kwargs}")

    mm_data = {}
    if image_inputs is not None:
        mm_data['image'] = image_inputs
    if video_inputs is not None:
        mm_data['video'] = video_inputs

    return {
        'prompt': text,
        'multi_modal_data': mm_data,
        'mm_processor_kwargs': video_kwargs
    }


if __name__ == '__main__':
    messages = [
        {
            "role": "user",
            "content": [
              {
                  "type": "image",
                  "image": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png",
              },
              {"type": "text", "text": "Read all the text in the image."},
            ],
        }
    ]

    # TODO: change to your own checkpoint path
    checkpoint_path = "Qwen/Qwen3-VL-235B-A22B-Instruct-FP8"
    processor = AutoProcessor.from_pretrained(checkpoint_path)
    inputs = [prepare_inputs_for_vllm(message, processor) for message in [messages]]

    llm = LLM(
        model=checkpoint_path,
        mm_encoder_tp_mode="data",
        enable_expert_parallel=True,
        tensor_parallel_size=torch.cuda.device_count(),
        seed=0
    )

    sampling_params = SamplingParams(
        temperature=0,
        max_tokens=1024,
        top_k=-1,
        stop_token_ids=[],
    )

    for i, input_ in enumerate(inputs):
        print()
        print('=' * 40)
        print(f"Inputs[{i}]: {input_['prompt']=!r}")
    print('\n' + '>' * 40)

    outputs = llm.generate(inputs, sampling_params=sampling_params)
    for i, output in enumerate(outputs):
        generated_text = output.outputs[0].text
        print()
        print('=' * 40)
        print(f"Generated text: {generated_text!r}")

Extended Context Length with YaRN

The current config.json supports context length up to 256K tokens. To handle extensive inputs exceeding 256K tokens, you can enable YaRN for enhanced length extrapolation.

Configuration Method

Modify max_position_embeddings and rope_scaling in config.json:

{
    "max_position_embeddings": 1000000,
    "rope_scaling": {
        "rope_type": "yarn",
        "mrope_section": [24, 20, 20],
        "mrope_interleaved": true,
        "factor": 3.0,
        "original_max_position_embeddings": 262144
    }
}

vLLM Server Arguments

When using vLLM for serving, enable YaRN with additional arguments:

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --rope-scaling '{"rope_type":"yarn","factor":3.0,"original_max_position_embeddings": 262144,"mrope_section":[24,20,20],"mrope_interleaved": true}' \
  --max-model-len 1000000

Because Interleaved-MRoPE’s position IDs grow more slowly than vanilla RoPE, use a smaller scaling factor. For example, to support 1M context with 256K context length, set factor=2 or 3 — not 4.

Next Steps

Explore SGLang deployment as an alternative
Try the Docker deployment for quick setup
Learn about the DashScope API service

Get Started

Core Concepts

Inference

Deployment

Fine-tuning

Capabilities

Overview

Installation

Online Serving

Starting the vLLM Server

Image Request Example

Video Request Example

Offline Inference

Extended Context Length with YaRN

Configuration Method

vLLM Server Arguments

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Inference

Deployment

Fine-tuning

Capabilities

Documentation Index

​Overview

​Installation

​Online Serving

​Starting the vLLM Server

​Image Request Example

​Video Request Example

​Offline Inference

​Extended Context Length with YaRN

​Configuration Method

​vLLM Server Arguments

​Next Steps

Build docs developers (and LLMs) love

Overview

Installation

Online Serving

Starting the vLLM Server

Image Request Example

Video Request Example

Offline Inference

Extended Context Length with YaRN

Configuration Method

vLLM Server Arguments

Next Steps