Streaming Responses

Streaming allows you to receive model responses incrementally, which is useful for providing real-time feedback to users. Qwen3-VL supports streaming through vLLM and SGLang deployment.

vLLM Server Setup

Launch a vLLM server with streaming support:

# Efficient inference with FP8 checkpoint
# Requires NVIDIA H100+ and CUDA 12+
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel \
  --async-scheduling \
  --media-io-kwargs '{"video": {"num_frames": -1}}' \
  --host 0.0.0.0 \
  --port 22002

SGLang Server Setup

Alternatively, launch an SGLang server:

python -m sglang.launch_server \
   --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \
   --host 0.0.0.0 \
   --port 22002 \
   --tp 4

Streaming Client Example

Once your server is running, you can use the OpenAI client to stream responses:

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:22002/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
                }
            },
            {
                "type": "text",
                "text": "Read all the text in the image."
            }
        ]
    }
]

start = time.time()
response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Instruct-FP8",
    messages=messages,
    max_tokens=2048,
    stream=True  # Enable streaming
)

# Process streaming response
for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print(f"\n\nResponse costs: {time.time() - start:.2f}s")

Video Streaming Example

You can also stream responses for video inputs:

import time
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:22002/v1",
    timeout=3600
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4"
                }
            },
            {
                "type": "text",
                "text": "How long is this video?"
            }
        ]
    }
]

start = time.time()

# Configure video frame sampling (vLLM only)
response = client.chat.completions.create(
    model="Qwen/Qwen3-VL-235B-A22B-Instruct-FP8",
    messages=messages,
    max_tokens=2048,
    stream=True,
    extra_body={"mm_processor_kwargs": {"fps": 2, "do_sample_frames": True}}
)

for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print(f"\n\nResponse costs: {time.time() - start:.2f}s")

Server Configuration Options

vLLM Options

--tensor-parallel-size: Number of GPUs for tensor parallelism
--mm-encoder-tp-mode: Multimodal encoder tensor parallel mode
--enable-expert-parallel: Enable expert parallelism for MoE models
--async-scheduling: Enable async scheduling for better throughput
--media-io-kwargs: Configure video frame sampling

SGLang Options

--tp: Tensor parallel size
--model-path: Path to model checkpoint
--host and --port: Server address configuration

Installation Requirements

pip install accelerate
pip install qwen-vl-utils==0.0.14
# Install vLLM (requires version >= 0.11.0)
uv pip install -U vllm

Benefits of Streaming

Real-Time Feedback: Users see responses as they’re generated
Better UX: Reduces perceived latency
Early Termination: Can stop generation early if needed
Progress Indication: Shows the model is actively processing

Additional Resources

For more details on deployment and serving options, refer to the vLLM documentation and the vLLM community guide for Qwen3-VL.

Cookbooks

Integration Examples

Streaming Responses

vLLM Server Setup

SGLang Server Setup

Streaming Client Example

Video Streaming Example

Server Configuration Options

vLLM Options

SGLang Options

Installation Requirements

Benefits of Streaming

Additional Resources

Build docs developers (and LLMs) love

Cookbooks

Integration Examples

Documentation Index

​vLLM Server Setup

​SGLang Server Setup

​Streaming Client Example

​Video Streaming Example

​Server Configuration Options

​vLLM Options

​SGLang Options

​Installation Requirements

​Benefits of Streaming

​Additional Resources

Build docs developers (and LLMs) love

vLLM Server Setup

SGLang Server Setup

Streaming Client Example

Video Streaming Example

Server Configuration Options

vLLM Options

SGLang Options

Installation Requirements

Benefits of Streaming

Additional Resources