Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/alibaba/page-agent/llms.txt

Use this file to discover all available pages before exploring further.

Page Agent is a purely web-based GUI agent that lives inside your webpage. Unlike server-side automation tools such as browser-use or Playwright, Page Agent runs entirely in the browser as an in-page JavaScript library β€” no Python runtime, no headless browser, no browser extension required. Web developers drop it into an existing site and their users immediately gain the ability to describe what they want in plain English (or Chinese), and watch the page respond.

Core Features

🧠 Smart DOM Analysis

Reads and reasons about your page through its DOM structure β€” no screenshots, no multimodal models. High-intensity dehydration produces a compact, text-based representation that standard LLMs can process quickly and accurately.

⚑ Zero Backend

Import via CDN or npm. Point the agent at any OpenAI-compatible LLM endpoint. Nothing new to deploy server-side β€” the agent calls the LLM directly from the browser.

πŸ”‘ Bring Your Own LLM

Works with any OpenAI-compatible API: Alibaba Qwen, OpenAI, Anthropic (via proxy), Ollama, LM Studio, and more. You supply the baseURL, model, and apiKey β€” Page Agent does the rest.

πŸ”’ Secure & Controllable

Supports operation allowlists and blocklists, data masking via transformPageContent, and custom system instructions. Make the agent follow your product’s rules rather than acting on whatever the DOM happens to contain.

β™Ώ Accessible Intelligence

Provides a natural-language interface for complex B2B systems and admin panels, making software approachable for every user β€” including those relying on voice commands or screen readers.

πŸ™ Optional Multi-Page Extension

For tasks that span multiple browser tabs, the optional Chrome Extension (PageAgentExt) gives the agent browser-level control: open, switch, and close tabs β€” without changing a line of your core integration.

Page Agent vs. browser-use

Page Agent builds on concepts pioneered by browser-use, but it solves a different problem. The table below captures the key differences:
page-agentbrowser-use
DeploymentEmbedded component β€” ships inside your webpageExternal tool β€” runs alongside a Python script
ScopeCurrent page (designed for SPAs)Entire browser, multiple tabs
Target UsersWeb developers building productsScraper & agent developers
Primary Use CaseUX enhancement for end-usersAutomated data extraction & task runners
RuntimeBrowser JavaScriptPython + Playwright
MultimodalNo (text/DOM only)Yes (screenshots)
Page Agent is intentionally scoped to client-side web enhancement, not server-side automation. For tasks that need to cross browser tabs, pair it with the optional Chrome Extension.

Use Cases

  • SaaS AI Copilot β€” Ship an AI copilot in your product in a few lines of code, with no backend rewrite. Users describe a goal; the agent drives the UI.
  • Smart Form Filling β€” Turn 20-click ERP or CRM workflows into a single sentence. Perfect for admin panels and data-entry-heavy applications.
  • Accessibility β€” Give visually impaired or elderly users a natural-language interface to any web app. Connect to a screen reader or voice assistant as the input channel.
  • Interactive Training β€” Let AI demonstrate complete workflows in real time β€” e.g., β€œshow me how to submit an expense report” β€” so users learn by watching.
  • Multi-Page Automation β€” Extend your in-page agent across browser tabs with the Chrome Extension for end-to-end, multi-step workflows.

Architecture: The Re-Act Loop

Page Agent follows a Re-Act (Reason + Act) loop inspired by browser-use. Each step consists of four stages:
Observe β†’ Think β†’ Act β†’ Loop
  1. Observe β€” PageController reads the live DOM, extracts a compact text representation of all interactive elements, and captures the current URL and scroll state.
  2. Think β€” The LLM receives the page snapshot along with the task description and the full history of previous steps. It reflects on what happened, updates its short-term memory, decides on a next goal, and chooses an action tool to call.
  3. Act β€” The chosen tool is executed (e.g., click, type, scroll, done). The result is appended to history.
  4. Loop β€” Steps repeat until the agent calls done, the abort signal fires, or maxSteps is reached.
The reflection-before-action structure (fields evaluation_previous_goal, memory, next_goal) is enforced in the LLM’s tool call schema, keeping the agent self-correcting across every step.
Page Agent uses a text-only DOM representation β€” no screenshots, no multimodal models. This means standard chat/completion models work out of the box, and page content never leaves the browser as an image.

Next Steps

Quickstart

Get a working agent on your page in under five minutes β€” CDN one-liner or npm install.

Models

Browse the tested LLM list, including a free testing API for evaluation.

Build docs developers (and LLMs) love