Documentation Index
Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen3-VL/llms.txt
Use this file to discover all available pages before exploring further.
Mobile Agent
Qwen3-VL powers intelligent mobile agent capabilities that enable the model to understand, interact with, and control mobile phone interfaces. The model can locate UI elements, understand their functions, invoke tools, and complete complex tasks on mobile devices.Capabilities
Visual Agent for Mobile
Qwen3-VL’s mobile agent can:- Element Recognition: Identify buttons, menus, icons, and UI components
- Function Understanding: Comprehend what each UI element does
- Tool Invocation: Execute actions through detected UI elements
- Task Completion: Perform multi-step tasks automatically
- Screen Understanding: Analyze entire mobile interfaces holistically
GUI Interaction
- Tap Actions: Identify and execute tap interactions
- Swipe Gestures: Understand and perform swipe operations
- Text Input: Fill forms and enter text in appropriate fields
- Navigation: Move through app screens and menus
- State Understanding: Track app state across interactions
How It Works
Visual Understanding
The mobile agent:- Analyzes Screenshots: Processes mobile screen images
- Identifies Elements: Detects interactive UI components
- Understands Context: Comprehends app state and functionality
- Plans Actions: Determines sequence of steps to accomplish goals
- Executes Tasks: Performs actions to complete objectives
Grounding & Localization
- Uses 2D grounding to locate UI elements precisely
- Identifies clickable areas and interaction zones
- Understands spatial layout of mobile interfaces
- Tracks element positions across screen changes
Use Cases
Automated Testing
- UI Testing: Automate mobile app testing procedures
- Regression Testing: Verify app functionality across updates
- User Flow Testing: Test complete user journeys
- Cross-platform Testing: Test apps on different devices
Accessibility
- Screen Reader Enhancement: Improve mobile accessibility
- Voice Control: Enable voice-based mobile control
- Assistive Technology: Help users with disabilities interact with mobile devices
Task Automation
- Workflow Automation: Automate repetitive mobile tasks
- Data Entry: Fill forms and input information automatically
- App Integration: Connect multiple apps for automated workflows
- Scheduled Tasks: Perform tasks at specific times
User Assistance
- Tutorial Creation: Generate step-by-step guides from demonstrations
- Help Systems: Build interactive help for mobile apps
- Onboarding: Create automated onboarding experiences
Try It Out
Explore mobile agent capabilities with our interactive cookbook:Mobile Agent Cookbook
Locate and think for mobile phone control.
Key Features
Visual Agent Capabilities
Qwen3-VL’s mobile agent operates through:- GUI Recognition: Identifies mobile UI elements and functions
- Tool Invocation: Executes interactions with UI components
- Task Understanding: Comprehends complex multi-step objectives
- Contextual Awareness: Maintains understanding across screens
Intelligent Interaction
- Natural Language Control: Describe tasks in plain language
- Error Handling: Adapt when expected UI elements are missing
- Multi-app Workflows: Navigate across multiple applications
- State Management: Track and remember app states
Technical Approach
Element Detection
- Uses advanced visual perception to identify UI elements
- Applies 2D grounding for precise localization
- Recognizes common mobile UI patterns
- Handles various screen sizes and resolutions
Action Planning
- Breaks down complex tasks into steps
- Determines optimal action sequences
- Adapts to changing UI states
- Validates action completion
Example Tasks
- “Open the settings app and enable dark mode”
- “Find and book a restaurant for tonight”
- “Check my calendar and send a meeting invite”
- “Download and install a specific app”
- “Navigate to a webpage and fill out a form”
Safety & Privacy
When using mobile agent capabilities:- Always obtain proper authorization
- Respect user privacy and data
- Follow app terms of service
- Implement appropriate security measures
- Use for legitimate purposes only
Related Capabilities
- Computer Use - Desktop and web control
- 2D Grounding - Locate UI elements
- Omni Recognition - Identify mobile UI components
- Spatial Understanding - Understand mobile interface layouts