Page Agent is a purely web-based GUI agent that lives inside your webpage. Unlike server-side automation tools such asDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/alibaba/page-agent/llms.txt
Use this file to discover all available pages before exploring further.
browser-use or Playwright, Page Agent runs entirely in the browser as an in-page JavaScript library β no Python runtime, no headless browser, no browser extension required. Web developers drop it into an existing site and their users immediately gain the ability to describe what they want in plain English (or Chinese), and watch the page respond.
Core Features
π§ Smart DOM Analysis
Reads and reasons about your page through its DOM structure β no screenshots, no multimodal models. High-intensity dehydration produces a compact, text-based representation that standard LLMs can process quickly and accurately.
β‘ Zero Backend
Import via CDN or npm. Point the agent at any OpenAI-compatible LLM endpoint. Nothing new to deploy server-side β the agent calls the LLM directly from the browser.
π Bring Your Own LLM
Works with any OpenAI-compatible API: Alibaba Qwen, OpenAI, Anthropic (via proxy), Ollama, LM Studio, and more. You supply the
baseURL, model, and apiKey β Page Agent does the rest.π Secure & Controllable
Supports operation allowlists and blocklists, data masking via
transformPageContent, and custom system instructions. Make the agent follow your productβs rules rather than acting on whatever the DOM happens to contain.βΏ Accessible Intelligence
Provides a natural-language interface for complex B2B systems and admin panels, making software approachable for every user β including those relying on voice commands or screen readers.
π Optional Multi-Page Extension
For tasks that span multiple browser tabs, the optional Chrome Extension (PageAgentExt) gives the agent browser-level control: open, switch, and close tabs β without changing a line of your core integration.
Page Agent vs. browser-use
Page Agent builds on concepts pioneered bybrowser-use, but it solves a different problem. The table below captures the key differences:
| page-agent | browser-use | |
|---|---|---|
| Deployment | Embedded component β ships inside your webpage | External tool β runs alongside a Python script |
| Scope | Current page (designed for SPAs) | Entire browser, multiple tabs |
| Target Users | Web developers building products | Scraper & agent developers |
| Primary Use Case | UX enhancement for end-users | Automated data extraction & task runners |
| Runtime | Browser JavaScript | Python + Playwright |
| Multimodal | No (text/DOM only) | Yes (screenshots) |
Page Agent is intentionally scoped to client-side web enhancement, not server-side automation. For tasks that need to cross browser tabs, pair it with the optional Chrome Extension.
Use Cases
- SaaS AI Copilot β Ship an AI copilot in your product in a few lines of code, with no backend rewrite. Users describe a goal; the agent drives the UI.
- Smart Form Filling β Turn 20-click ERP or CRM workflows into a single sentence. Perfect for admin panels and data-entry-heavy applications.
- Accessibility β Give visually impaired or elderly users a natural-language interface to any web app. Connect to a screen reader or voice assistant as the input channel.
- Interactive Training β Let AI demonstrate complete workflows in real time β e.g., βshow me how to submit an expense reportβ β so users learn by watching.
- Multi-Page Automation β Extend your in-page agent across browser tabs with the Chrome Extension for end-to-end, multi-step workflows.
Architecture: The Re-Act Loop
Page Agent follows a Re-Act (Reason + Act) loop inspired by browser-use. Each step consists of four stages:- Observe β
PageControllerreads the live DOM, extracts a compact text representation of all interactive elements, and captures the current URL and scroll state. - Think β The LLM receives the page snapshot along with the task description and the full history of previous steps. It reflects on what happened, updates its short-term memory, decides on a next goal, and chooses an action tool to call.
- Act β The chosen tool is executed (e.g.,
click,type,scroll,done). The result is appended to history. - Loop β Steps repeat until the agent calls
done, the abort signal fires, ormaxStepsis reached.
evaluation_previous_goal, memory, next_goal) is enforced in the LLMβs tool call schema, keeping the agent self-correcting across every step.
Page Agent uses a text-only DOM representation β no screenshots, no multimodal models. This means standard chat/completion models work out of the box, and page content never leaves the browser as an image.
Next Steps
Quickstart
Get a working agent on your page in under five minutes β CDN one-liner or npm install.
Models
Browse the tested LLM list, including a free testing API for evaluation.