Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Qwen3-VL has been extensively evaluated on a wide range of benchmarks to demonstrate its capabilities in both visual understanding and text processing. Below are the comprehensive benchmark results for all model variants.

Visual Tasks Performance

Large Models (235B & 30B MoE)

Instruct Models - Visual Tasks

Qwen3-VL Large Instruct Models - Visual Tasks Qwen3-VL-235B-A22B-Instruct and Qwen3-VL-30B-A3B-Instruct demonstrate state-of-the-art performance across visual benchmarks.

Thinking Models - Visual Tasks

Qwen3-VL Large Thinking Models - Visual Tasks Qwen3-VL-235B-A22B-Thinking and Qwen3-VL-30B-A3B-Thinking show enhanced reasoning capabilities on complex visual tasks.

MoE Models Comparison

Qwen3-VL-30B-A3B InstructQwen3-VL-30B-A3B Thinking

Smaller Models (2B-32B Dense)

2B & 32B Models - Visual Tasks

Qwen3-VL 2B & 32B Instruct - VisualQwen3-VL 2B & 32B Thinking - Visual
Comprehensive comparison of Qwen3-VL-2B and Qwen3-VL-32B models in both Instruct and Thinking editions across visual understanding benchmarks.

Text-Centric Tasks Performance

Large Models (235B & 30B MoE)

Instruct Models - Text Tasks

Qwen3-VL Large Instruct - Text Tasks Qwen3-VL demonstrates text understanding on par with pure LLMs, showing seamless text-vision fusion.

Thinking Models - Text Tasks

Qwen3-VL Large Thinking - Text Tasks Thinking editions show enhanced performance on reasoning-heavy text tasks.

MoE Model - Text Tasks

Qwen3-VL-30B-A3B Text Performance

Smaller Models (4B & 8B Dense)

4B & 8B Models - Text Tasks

Qwen3-VL 4B & 8B Instruct - TextQwen3-VL 4B & 8B Thinking - Text
Performance comparison of Qwen3-VL-4B and Qwen3-VL-8B models on text-centric benchmarks.

Key Capabilities

Visual Understanding

  • Image Recognition: State-of-the-art performance on standard vision benchmarks
  • OCR: Support for 32 languages with robustness to challenging conditions
  • Document Parsing: Advanced layout understanding and structure extraction
  • Object Grounding: Precise 2D and 3D object localization
  • Video Understanding: Long-form video comprehension with temporal reasoning

Text Processing

  • Pure Text Tasks: Performance comparable to dedicated LLMs
  • Multimodal Reasoning: Seamless integration of visual and textual information
  • STEM/Math: Enhanced reasoning capabilities, especially in Thinking editions
  • Multilingual: Strong performance across multiple languages

Specialized Capabilities

  • Spatial Understanding: Advanced 3D reasoning and spatial relationships
  • Coding: Visual coding from screenshots to HTML/CSS/JavaScript
  • Agent Tasks: GUI interaction and tool use
  • Long Context: Native 256K tokens, expandable to 1M

Evaluation Settings

To ensure reproducibility, we provide our official evaluation configuration:

Inference & Evaluation

Generation Hyperparameters

Instruct Models

export greedy='false'
export seed=3407
export top_p=0.8
export top_k=20
export temperature=0.7
export repetition_penalty=1.0
export presence_penalty=1.5
export out_seq_length=32768

Thinking Models

export greedy='false'
export seed=1234
export top_p=0.95
export top_k=20
export repetition_penalty=1.0
export presence_penalty=0.0
export temperature=0.6
export out_seq_length=40960

Notes on Evaluation

  • For certain benchmarks, evaluation prompts were slightly modified for better performance
  • Some benchmarks are internally constructed; reproduction code will be released
  • Detailed methodology will be documented in the technical report

Benchmark Categories

Visual Tasks Evaluated

  • General vision understanding
  • OCR and document analysis
  • Object detection and grounding
  • Video question answering
  • Spatial reasoning
  • Visual coding
  • Agent tasks (GUI interaction)

Text-Centric Tasks Evaluated

  • Natural language understanding
  • Mathematical reasoning
  • Code understanding and generation
  • Logical reasoning
  • Common sense reasoning
  • Multilingual comprehension

Performance Highlights

Qwen3-VL-235B-A22B

  • Best-in-class performance across most benchmarks
  • State-of-the-art visual reasoning
  • Comparable to pure LLMs on text tasks

Qwen3-VL-30B-A3B

  • Excellent performance-to-cost ratio
  • MoE architecture for efficient inference
  • Strong across both visual and text tasks

Qwen3-VL-2B to 32B

  • Scalable performance across model sizes
  • 2B suitable for edge deployment
  • 32B competitive with larger models
  • Thinking editions show consistent improvements

Build docs developers (and LLMs) love