SGLang Deployment

Overview

SGLang provides an alternative high-performance serving solution for Qwen3-VL models. You can start an SGLang server to serve models efficiently with an OpenAI-style API.

Starting the SGLang Server

Launch the SGLang server with the following command:

python -m sglang.launch_server \
   --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \
   --host 0.0.0.0 \
   --port 22002 \
   --tp 4

Server Parameters

--model-path: Path to the model (local path or HuggingFace model ID)
--host: Server host address (default: 0.0.0.0)
--port: Server port (default: 22002)
--tp: Tensor parallelism size for multi-GPU deployment

Making Requests

Once the server is running, you can make requests using the OpenAI-compatible API.

Image Request Example

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:22002/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                }
            },
            {
                "type": "text",
                "text": "Read all the text in the image."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Instruct",
    messages=messages,
    max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

Video Request Example

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:22002/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4"
                }
            },
            {
                "type": "text",
                "text": "How long is this video?"
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Instruct",
    messages=messages,
    max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")

Offline Inference

You can also use SGLang for local offline inference without running a server:

import time
from PIL import Image
from sglang import Engine
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, AutoConfig


if __name__ == "__main__":
    # TODO: change to your own checkpoint path
    checkpoint_path = "Qwen/Qwen3-VL-235B-A22B-Instruct"
    processor = AutoProcessor.from_pretrained(checkpoint_path)

    messages = [
        {
            "role": "user",
            "content": [
              {
                  "type": "image",
                  "image": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png",
              },
              {"type": "text", "text": "Read all the text in the image."},
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    image_inputs, _ = process_vision_info(messages, image_patch_size=processor.image_processor.patch_size)

    llm = Engine(
        model_path=checkpoint_path,
        enable_multimodal=True,
        mem_fraction_static=0.8,
        tp_size=4,
        attention_backend="fa3",
        context_length=10240,
        disable_cuda_graph=True,
    )

    start = time.time()
    sampling_params = {"max_new_tokens": 1024}
    response = llm.generate(prompt=text, image_data=image_inputs, sampling_params=sampling_params)
    print(f"Response costs: {time.time() - start:.2f}s")
    print(f"Generated text: {response['text']}")

Engine Configuration

The SGLang Engine supports various configuration options:

model_path: Path to the model checkpoint
enable_multimodal: Enable multimodal support (required for Qwen3-VL)
mem_fraction_static: GPU memory fraction for static allocation (0.0-1.0)
tp_size: Tensor parallelism size for multi-GPU deployment
attention_backend: Attention implementation (e.g., “fa3” for FlashAttention 3)
context_length: Maximum context length
disable_cuda_graph: Disable CUDA graph optimization (may be needed for stability)

Next Steps

Compare with vLLM deployment
Try the Docker deployment for quick setup
Learn about the DashScope API service

Get Started

Core Concepts

Inference

Deployment

Fine-tuning

Capabilities

SGLang Deployment

Overview

Starting the SGLang Server

Server Parameters

Making Requests

Image Request Example

Video Request Example

Offline Inference

Engine Configuration

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Inference

Deployment

Fine-tuning

Capabilities

Documentation Index

​Overview

​Starting the SGLang Server

​Server Parameters

​Making Requests

​Image Request Example

​Video Request Example

​Offline Inference

​Engine Configuration

​Next Steps

Build docs developers (and LLMs) love

Overview

Starting the SGLang Server

Server Parameters

Making Requests

Image Request Example

Video Request Example

Offline Inference

Engine Configuration

Next Steps