Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

GPU and Memory Issues

CUDA Out of Memory

Problem: Getting CUDA out of memory errors when loading or running models. Solutions:
  1. Use Quantized Models
    • FP8 quantization for H100/H200 GPUs (requires CUDA 12+)
    • Check the HuggingFace collection for quantized versions
  2. Adjust Precision
    # Use bfloat16 instead of float32
    model = AutoModelForImageTextToText.from_pretrained(
        "Qwen/Qwen3-VL-8B-Instruct",
        dtype=torch.bfloat16,
        device_map="auto"
    )
    
  3. Enable Flash Attention 2
    # Install flash-attn first
    # pip install -U flash-attn --no-build-isolation
    
    model = AutoModelForImageTextToText.from_pretrained(
        "Qwen/Qwen3-VL-8B-Instruct",
        dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
        device_map="auto"
    )
    
  4. Reduce Visual Token Budget
    # For images - reduce max_pixels
    processor.image_processor.size = {
        "longest_edge": 512*32*32,  # Reduced from default
        "shortest_edge": 256*32*32
    }
    
    # For videos - reduce frame count or resolution
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        fps=1  # Reduce from default fps=2
    )
    
  5. Use a Smaller Model
    • Qwen3-VL-2B or 4B for edge/consumer GPUs
    • Qwen3-VL-30B-A3B (MoE) has only 3B active parameters

Minimum VRAM Requirements

Actual memory usage is typically 1.2-1.5x the theoretical minimum due to activations and intermediate tensors.
Estimated VRAM by Model Size (BF16 precision):
ModelBF16 VRAMINT8 VRAMNotes
Qwen3-VL-2B~4-5 GB~2-3 GBSuitable for consumer GPUs
Qwen3-VL-4B~8-10 GB~4-5 GBRTX 3090/4090
Qwen3-VL-8B~16-20 GB~8-10 GBA100 40GB, H100
Qwen3-VL-32B~64-80 GB~32-40 GBMulti-GPU required
Qwen3-VL-30B-A3B~60-75 GB~30-38 GBMoE model
Qwen3-VL-235B-A22B~450-550 GB~225-275 GB8x H100 recommended

Multi-GPU Setup

Problem: Model doesn’t fit on a single GPU. Solution: Use tensor parallelism or model parallelism
# Automatic device mapping
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-32B-Instruct",
    dtype=torch.bfloat16,
    device_map="auto"  # Automatically splits across GPUs
)
For vLLM:
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --enable-expert-parallel

Installation Issues

Transformers Version

Problem: Model loading fails or unexpected behavior. Solution: Ensure you have the correct transformers version:
# Qwen3-VL requires transformers >= 4.57.0
pip install "transformers>=4.57.0"

Flash Attention Installation

Problem: Flash Attention compilation fails. Solutions:
  1. Check CUDA compatibility
    • Flash Attention 2 requires CUDA 11.6+
    • Check GPU compatibility (Ampere, Ada, Hopper architectures)
  2. Install pre-built wheels
    pip install flash-attn --no-build-isolation
    
  3. Build from source (if wheels fail)
    git clone https://github.com/Dao-AILab/flash-attention
    cd flash-attention
    python setup.py install
    

Video Processing Dependencies

Problem: Video loading fails or hangs. Solutions:
  1. Install with video support
    # Recommended: Use torchcodec (fastest, most compatible)
    # See https://github.com/pytorch/torchcodec for installation
    
    # Or use decord (Linux only from PyPI)
    pip install qwen-vl-utils[decord]
    
    # Fallback: torchvision (slowest but most compatible)
    pip install qwen-vl-utils
    
  2. Video URL compatibility
    BackendHTTPHTTPS
    torchvision >= 0.19.0
    torchvision < 0.19.0
    decord
    torchcodec
  3. Force specific backend
    export FORCE_QWENVL_VIDEO_READER=torchcodec  # or decord, torchvision
    

Context Length Issues

Input Too Long

Problem: Sequence length exceeds model’s context window. Default Context Length: 256K tokens Solutions:
  1. Reduce Visual Tokens
    # Reduce image resolution
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "path/to/image.jpg",
                    "min_pixels": 50176,    # Lower resolution
                    "max_pixels": 50176,
                },
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ]
    
  2. Enable YaRN for Extended Context (up to 1M tokens) Modify config.json:
    {
        "max_position_embeddings": 1000000,
        "rope_scaling": {
            "rope_type": "yarn",
            "mrope_section": [24, 20, 20],
            "mrope_interleaved": true,
            "factor": 3.0,
            "original_max_position_embeddings": 262144
        }
    }
    
    For vLLM:
    vllm serve Qwen/Qwen3-VL-8B-Instruct \
      --rope-scaling '{"rope_type":"yarn","factor":3.0,"original_max_position_embeddings":262144,"mrope_section":[24,20,20],"mrope_interleaved":true}' \
      --max-model-len 1000000
    
Because Interleaved-MRoPE’s position IDs grow more slowly than vanilla RoPE, use a smaller scaling factor. For 1M context with 256K base, use factor=2 or 3, not 4.

Video Too Long

Problem: Long videos exceed token budget. Solutions:
  1. Reduce FPS
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        fps=1  # Lower FPS for longer videos
    )
    
  2. Set Frame Limit
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
        num_frames=128,  # Maximum frames
        fps=None  # Overwrite fps
    )
    
  3. Use total_pixels limit with qwen-vl-utils
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": "path/to/video.mp4",
                    "total_pixels": 20480 * 32 * 32,  # Limit total tokens
                },
                {"type": "text", "text": "Describe this video."},
            ],
        }
    ]
    

Model Loading Issues

Download Errors

Problem: Model download fails or is very slow. Solutions:
  1. For users in mainland China: Use ModelScope
    from modelscope import snapshot_download
    
    model_dir = snapshot_download('qwen/Qwen3-VL-8B-Instruct')
    
  2. Resume interrupted downloads
    from huggingface_hub import snapshot_download
    
    snapshot_download(
        "Qwen/Qwen3-VL-8B-Instruct",
        resume_download=True
    )
    
  3. Use HF mirror (set environment variable)
    export HF_ENDPOINT=https://hf-mirror.com
    

Import Errors

Problem: ImportError or ModuleNotFoundError. Solutions:
  1. Check all dependencies
    pip install transformers>=4.57.0 accelerate qwen-vl-utils
    
  2. For vLLM
    pip install vllm>=0.11.0
    
  3. For web demo
    pip install -r requirements_web_demo.txt
    

Inference Issues

Slow Inference

Problem: Generation is very slow. Solutions:
  1. Use vLLM for production
    vllm serve Qwen/Qwen3-VL-8B-Instruct \
      --host 0.0.0.0 \
      --port 8000
    
  2. Enable Flash Attention
    model = AutoModelForImageTextToText.from_pretrained(
        "Qwen/Qwen3-VL-8B-Instruct",
        attn_implementation="flash_attention_2"
    )
    
  3. Use FP8 quantization (H100/H200)
    vllm serve Qwen/Qwen3-VL-8B-Instruct-FP8
    
  4. Batch inference
    # Process multiple inputs together
    processor.tokenizer.padding_side = 'left'
    inputs = processor.apply_chat_template(
        [messages1, messages2, messages3],
        padding=True
    )
    

Unexpected Outputs

Problem: Model generates incorrect or unexpected results. Solutions:
  1. Check input format
    • Verify image/video paths are correct
    • Ensure proper message structure
  2. Adjust generation parameters
    # For more deterministic outputs
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.1,  # Lower temperature
        top_p=0.9,
        do_sample=True
    )
    
  3. Use appropriate model edition
    • Instruct: General-purpose tasks
    • Thinking: Complex reasoning, STEM, math
  4. Verify processor settings
    # Reset to defaults if customized
    processor = AutoProcessor.from_pretrained(
        "Qwen/Qwen3-VL-8B-Instruct"
    )
    

Docker Issues

Container Won’t Start

Problem: Docker container fails to start. Solutions:
  1. Check GPU availability
    docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
    
  2. Install NVIDIA Container Toolkit
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
      sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    sudo systemctl restart docker
    
  3. Use official image
    docker run --gpus all --ipc=host --network=host --rm \
      -it qwenllm/qwenvl:qwen3vl-cu128 bash
    

API Issues

API Authentication Errors

Problem: API calls fail with authentication errors. Solution: Set API key correctly
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DASHSCOPE_API_KEY",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
See the API documentation for more details.

Getting Help

If you’re still experiencing issues:
  1. Check the GitHub Issues: Qwen3-VL Issues
  2. Join the Community:
  3. Consult the Documentation:

Build docs developers (and LLMs) love