Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt

Use this file to discover all available pages before exploring further.

Mobile Agent

Qwen3-VL powers intelligent mobile agent capabilities that enable the model to understand, interact with, and control mobile phone interfaces. The model can locate UI elements, understand their functions, invoke tools, and complete complex tasks on mobile devices.

Capabilities

Visual Agent for Mobile

Qwen3-VL’s mobile agent can:
  • Element Recognition: Identify buttons, menus, icons, and UI components
  • Function Understanding: Comprehend what each UI element does
  • Tool Invocation: Execute actions through detected UI elements
  • Task Completion: Perform multi-step tasks automatically
  • Screen Understanding: Analyze entire mobile interfaces holistically

GUI Interaction

  • Tap Actions: Identify and execute tap interactions
  • Swipe Gestures: Understand and perform swipe operations
  • Text Input: Fill forms and enter text in appropriate fields
  • Navigation: Move through app screens and menus
  • State Understanding: Track app state across interactions

How It Works

Visual Understanding

The mobile agent:
  1. Analyzes Screenshots: Processes mobile screen images
  2. Identifies Elements: Detects interactive UI components
  3. Understands Context: Comprehends app state and functionality
  4. Plans Actions: Determines sequence of steps to accomplish goals
  5. Executes Tasks: Performs actions to complete objectives

Grounding & Localization

  • Uses 2D grounding to locate UI elements precisely
  • Identifies clickable areas and interaction zones
  • Understands spatial layout of mobile interfaces
  • Tracks element positions across screen changes

Use Cases

Automated Testing

  • UI Testing: Automate mobile app testing procedures
  • Regression Testing: Verify app functionality across updates
  • User Flow Testing: Test complete user journeys
  • Cross-platform Testing: Test apps on different devices

Accessibility

  • Screen Reader Enhancement: Improve mobile accessibility
  • Voice Control: Enable voice-based mobile control
  • Assistive Technology: Help users with disabilities interact with mobile devices

Task Automation

  • Workflow Automation: Automate repetitive mobile tasks
  • Data Entry: Fill forms and input information automatically
  • App Integration: Connect multiple apps for automated workflows
  • Scheduled Tasks: Perform tasks at specific times

User Assistance

  • Tutorial Creation: Generate step-by-step guides from demonstrations
  • Help Systems: Build interactive help for mobile apps
  • Onboarding: Create automated onboarding experiences

Try It Out

Explore mobile agent capabilities with our interactive cookbook:

Mobile Agent Cookbook

Locate and think for mobile phone control.
Open In Colab

Key Features

Visual Agent Capabilities

Qwen3-VL’s mobile agent operates through:
  • GUI Recognition: Identifies mobile UI elements and functions
  • Tool Invocation: Executes interactions with UI components
  • Task Understanding: Comprehends complex multi-step objectives
  • Contextual Awareness: Maintains understanding across screens

Intelligent Interaction

  • Natural Language Control: Describe tasks in plain language
  • Error Handling: Adapt when expected UI elements are missing
  • Multi-app Workflows: Navigate across multiple applications
  • State Management: Track and remember app states

Technical Approach

Element Detection

  • Uses advanced visual perception to identify UI elements
  • Applies 2D grounding for precise localization
  • Recognizes common mobile UI patterns
  • Handles various screen sizes and resolutions

Action Planning

  • Breaks down complex tasks into steps
  • Determines optimal action sequences
  • Adapts to changing UI states
  • Validates action completion

Example Tasks

  • “Open the settings app and enable dark mode”
  • “Find and book a restaurant for tonight”
  • “Check my calendar and send a meeting invite”
  • “Download and install a specific app”
  • “Navigate to a webpage and fill out a form”

Safety & Privacy

When using mobile agent capabilities:
  • Always obtain proper authorization
  • Respect user privacy and data
  • Follow app terms of service
  • Implement appropriate security measures
  • Use for legitimate purposes only

Build docs developers (and LLMs) love