Documentation Index
Fetch the complete documentation index at: https://mintlify.com/primeintellect-ai/verifiers/llms.txt
Use this file to discover all available pages before exploring further.
This example demonstrates browser automation environments using two different modes: DOM mode (natural language actions via Stagehand) and CUA mode (vision-based control with low-level primitives). Both integrate with Browserbase for cloud browser management.
Overview
Verifiers provides two browser automation approaches:
| Mode | Control Method | Tools | Best For |
|---|
| DOM | Natural language → Stagehand SDK | act, observe, extract, navigate | Semantic interactions, form filling |
| CUA | Vision + coordinates | click, type_text, scroll, screenshot | Precise control, visual tasks |
Both modes support:
- Cloud browsers via Browserbase
- Local browser automation
- Sandbox deployment (CUA mode)
- Screenshot capture and vision model integration
DOM Mode Example
DOM mode uses Stagehand’s AI-driven browser control for natural language interactions.
Complete Implementation
Main Code
Judge Evaluation
import verifiers as vf
from verifiers.envs.integrations.browser_env import BrowserEnv
from datasets import Dataset
DOM_SYSTEM_PROMPT = """You are a browser automation agent using Stagehand's AI-driven tools.
Available tools:
- navigate(url): Navigate to a URL
- observe(instruction): Find possible actions matching the instruction
- act(instruction): Execute an action described in natural language
- extract(instruction, schema_json): Extract structured data from the page
Use natural language to describe what you want to do. Stagehand will intelligently
find elements and execute actions without needing CSS selectors or coordinates.
Complete the given task efficiently."""
def create_example_dataset() -> Dataset:
return Dataset.from_dict({
"question": [
"What does the headline say on the primeintellect.ai homepage?"
],
"answer": ["The Open Superintelligence Stack"],
"start_url": ["https://primeintellect.ai"],
"task_id": ["dom-example-0"],
})
def load_environment(
project_id: str,
max_turns: int = 10,
judge_model: str = "gpt-4o-mini",
system_prompt: str = DOM_SYSTEM_PROMPT,
browserbase_api_key_var: str = "BROWSERBASE_API_KEY",
stagehand_model: str = "openai/gpt-4o-mini",
model_api_key_var: str = "MODEL_API_KEY",
proxy_model_to_stagehand: bool = False,
**kwargs,
) -> vf.Environment:
import os
# Check required env vars
missing = []
if not os.getenv(browserbase_api_key_var):
missing.append(browserbase_api_key_var)
if not os.getenv(model_api_key_var):
missing.append(model_api_key_var)
if missing:
raise ValueError(
f"Missing required environment variables: {', '.join(missing)}"
)
dataset = create_example_dataset()
# Create judge rubric
rubric = vf.JudgeRubric(
judge_model=judge_model,
judge_prompt=JUDGE_PROMPT,
)
rubric.add_reward_func(judge_answer, weight=1.0)
return BrowserEnv(
mode="dom",
dataset=dataset,
rubric=rubric,
max_turns=max_turns,
system_prompt=system_prompt,
project_id=project_id,
browserbase_api_key_var=browserbase_api_key_var,
stagehand_model=stagehand_model,
model_api_key_var=model_api_key_var,
proxy_model_to_stagehand=proxy_model_to_stagehand,
**kwargs,
)
JUDGE_PROMPT = """You are evaluating a browser automation agent's answer to a question.
Question:
Does the agent's response contain the correct answer? The answer may be embedded
in a longer response or phrased differently, but should convey the same information
as the expected answer.
Respond "yes" if the agent's response contains the correct answer, "no" if it does not."""
async def judge_answer(
judge,
prompt: str | list,
completion: str | list,
answer: str,
state: vf.State,
) -> float:
judge_response = await judge(prompt, completion, answer, state)
is_correct = "yes" in judge_response.lower()
return 1.0 if is_correct else 0.0
navigate(url) - Navigate to a URL
navigate("https://example.com")
act(instruction) - Perform action via natural language
act("click the login button")
act("type 'hello@example.com' into the email field")
act("select 'United States' from the country dropdown")
observe(instruction) - Find elements matching description
observe("find all product cards on the page")
# Returns: List of matching elements with metadata
extract(instruction, schema_json) - Extract structured data
schema = '{"title": "string", "price": "number"}'
extract("extract product information", schema)
# Returns: {"title": "Product Name", "price": 29.99}
DOM Mode Interaction Example
Successful Task
Natural Language Actions
Task: What does the headline say on the primeintellect.ai homepage?
Assistant: I’ll navigate to the homepage and extract the headline.Tool Call: navigate("https://primeintellect.ai")Tool Result: Navigation successful
Assistant: Now I’ll extract the main headline.Tool Call: act("read the main headline text")Tool Result: Found headline: “The Open Superintelligence Stack”
Assistant: The headline says “The Open Superintelligence Stack”.Judge: “yes”Result: ✓ Correct (reward = 1.0)DOM mode excels at semantic tasks:# Login form
act("click the 'Sign In' button")
act("type 'user@example.com' in the email field")
act("type 'password123' in the password field")
act("click the submit button")
# Search
act("type 'machine learning' in the search box")
act("press Enter")
# Navigation
act("click on the first search result")
act("scroll down to the comments section")
Stagehand intelligently finds elements without selectors.
CUA Mode Example
CUA (Computer Use Agent) mode provides low-level vision-based browser control.
Complete Implementation
from typing import Literal
import verifiers as vf
from verifiers.envs.integrations.browser_env import BrowserEnv
from datasets import Dataset
CUA_SYSTEM_PROMPT = """You are a browser automation agent. You can control a web browser using the provided tools.
Available tools:
- click(x, y, button): Click at coordinates
- double_click(x, y): Double-click at coordinates
- type_text(text): Type text into focused element
- keypress(keys): Press keyboard keys
- scroll(x, y, scroll_x, scroll_y): Scroll at position
- goto(url): Navigate to URL
- back(): Go back in history
- forward(): Go forward in history
- wait(time_ms): Wait for specified milliseconds
- screenshot(): Capture current page state
After each action, you will receive a screenshot showing the current page state.
Analyze the screenshot to determine your next action.
Complete the given task efficiently using the minimum number of actions necessary."""
def load_environment(
max_turns: int = 15,
judge_model: str = "gpt-4o-mini",
system_prompt: str = CUA_SYSTEM_PROMPT,
# CUA mode configuration
use_sandbox: bool = True,
server_url: str = "http://localhost:3000",
# Browserbase configuration
browserbase_api_key: str | None = None,
browserbase_project_id: str | None = None,
env: Literal["LOCAL", "BROWSERBASE"] = "BROWSERBASE",
# Pre-built image (fastest startup)
use_prebuilt_image: bool = True,
prebuilt_image: str = "deepdream19/cua-server:latest",
**kwargs,
) -> vf.Environment:
dataset = create_example_dataset()
rubric = vf.JudgeRubric(
judge_model=judge_model,
judge_prompt=JUDGE_PROMPT,
)
rubric.add_reward_func(judge_answer, weight=1.0)
return BrowserEnv(
mode="cua",
dataset=dataset,
rubric=rubric,
max_turns=max_turns,
system_prompt=system_prompt,
use_sandbox=use_sandbox,
server_url=server_url,
env=env,
browserbase_api_key=browserbase_api_key,
browserbase_project_id=browserbase_project_id,
use_prebuilt_image=use_prebuilt_image,
prebuilt_image=prebuilt_image,
**kwargs,
)
def create_example_dataset() -> Dataset:
"""Create example dataset for CUA mode."""
return Dataset.from_dict({
"question": [
"What does the headline say on the primeintellect.ai homepage?"
],
"answer": ["The Open Superintelligence Stack"],
"start_url": ["https://primeintellect.ai"],
"task_id": ["cua-example-0"],
})
click(x, y, button="left") - Click at coordinates
click(150, 300) # Click at (150, 300)
click(200, 400, button="right") # Right-click
type_text(text) - Type text into focused element
type_text("hello@example.com")
keypress(keys) - Press keyboard keys
keypress("Enter")
keypress("Ctrl+C")
keypress("Tab")
scroll(x, y, scroll_x, scroll_y) - Scroll at position
scroll(500, 500, 0, -100) # Scroll up 100 pixels
scroll(500, 500, 0, 100) # Scroll down
goto(url) - Navigate to URL
goto("https://example.com")
screenshot() - Capture current state
screenshot() # Returns base64 encoded image
CUA Mode Interaction Example
Vision-Based Control
Low-Level Actions
Task: Click the search button
Assistant: Let me take a screenshot to see the page.Tool Call: screenshot()Tool Result: [Image showing page with search button at (300, 150)]
Assistant: I can see the search button at coordinates (300, 150). I’ll click it.Tool Call: click(300, 150)Tool Result: Click successful
Assistant: Now I’ll type the search query.Tool Call: type_text("machine learning")Tool Result: Text entered
Assistant: Pressing Enter to search.Tool Call: keypress("Enter")CUA mode provides precise control:# Login sequence
click(200, 100) # Click email field
type_text("user@example.com")
keypress("Tab") # Move to password field
type_text("password123")
click(300, 200) # Click login button
# Form interaction
click(150, 300) # Focus dropdown
keypress("Down")
keypress("Down")
keypress("Enter")
# Scrolling and navigation
scroll(500, 500, 0, -200) # Scroll up
wait(1000) # Wait 1 second
screenshot() # Capture state
Mode Comparison
| Feature | DOM Mode | CUA Mode |
|---|
| Control | Natural language | Coordinates + vision |
| Setup | Requires Browserbase project | Requires sandbox or local server |
| Startup | Fast (~2-5s) | Slower (~5-30s depending on mode) |
| Best for | Semantic tasks, form filling | Precise control, visual inspection |
| Tools | 4 high-level | 10 low-level |
| Vision | Not required | Required for coordinates |
| Selectors | Not needed (AI-driven) | Not needed (coordinate-based) |
| Reliability | Good for standard elements | Excellent for any clickable item |
Running Browser Environments
Installation
# Install browser environments
prime env install browser-dom-example
prime env install browser-cua-example
DOM Mode Execution
# Requires: BROWSERBASE_API_KEY, MODEL_API_KEY, Browserbase project ID
prime eval run browser-dom-example \
-m openai/gpt-4o-mini \
-b https://api.openai.com/v1 \
-k OPENAI_API_KEY \
-a '{"project_id": "YOUR_PROJECT_ID"}' \
-n 10 \
-r 3
CUA Mode Execution
# Default: Pre-built image (fastest)
prime eval run browser-cua-example \
-m openai/gpt-4.1-mini \
-b https://api.openai.com/v1 \
-k OPENAI_API_KEY \
-n 10 \
-r 3
# Binary upload mode (custom server)
prime eval run browser-cua-example \
-m openai/gpt-4.1-mini \
-a '{"use_prebuilt_image": false}' \
-n 10
# Manual mode (local development)
# Terminal 1: cd cua-server && ./start.sh
# Terminal 2:
prime eval run browser-cua-example \
-m openai/gpt-4.1-mini \
-a '{"use_sandbox": false, "server_url": "http://localhost:3000"}' \
-n 10
Configuration Options
DOM Mode
| Parameter | Default | Description |
|---|
project_id | Required | Browserbase project ID |
max_turns | 10 | Maximum interactions |
judge_model | "gpt-4o-mini" | Judge model |
browserbase_api_key_var | "BROWSERBASE_API_KEY" | API key env var |
stagehand_model | "openai/gpt-4o-mini" | Model for Stagehand |
proxy_model_to_stagehand | False | Route Stagehand through eval model |
CUA Mode
| Parameter | Default | Description |
|---|
max_turns | 15 | Maximum interactions |
use_sandbox | True | Auto-deploy to sandbox |
use_prebuilt_image | True | Use pre-built Docker image (fastest) |
prebuilt_image | "deepdream19/cua-server:latest" | Docker image |
server_url | "http://localhost:3000" | Server URL (manual mode) |
env | "BROWSERBASE" | Browser env (LOCAL/BROWSERBASE) |
viewport_width | 1024 | Browser width |
viewport_height | 768 | Browser height |
save_screenshots | False | Save screenshots to disk |
keep_recent_screenshots | 2 | Screenshots in context |
Key Features
Browserbase Integration
Both modes support cloud browsers via Browserbase:
- No local browser installation needed
- Scalable cloud infrastructure
- Session recording and debugging
- Proxy support for geo-targeting
Setup:
- Sign up at browserbase.com
- Get API key and project ID
- Set environment variables
Vision Model Integration
CUA mode integrates screenshots with vision models:
# Screenshot automatically included in context
screenshot() # Returns base64 image
# Vision model analyzes image to determine coordinates
# Example: "I see the login button at approximately (250, 180)"
Sandbox Deployment
CUA mode supports three deployment modes:
1. Pre-built image (default, fastest)
BrowserEnv(
mode="cua",
use_prebuilt_image=True, # ~5-10s startup
prebuilt_image="deepdream19/cua-server:latest",
)
2. Binary upload (custom server)
BrowserEnv(
mode="cua",
use_prebuilt_image=False, # ~30-60s startup
use_binary=True,
)
3. Manual (local development)
# Terminal 1
cd cua-server && ./start.sh
# Terminal 2
BrowserEnv(
mode="cua",
use_sandbox=False,
server_url="http://localhost:3000",
)
Metrics Tracked
DOM Mode
judge_reward: Answer correctness (0.0 or 1.0)
num_turns: Interaction count
total_tool_calls: Tools used
- Per-tool counts:
navigate_calls, act_calls, etc.
CUA Mode
judge_reward: Answer correctness (0.0 or 1.0)
num_turns: Interaction count
total_tool_calls: Tools used
- Per-tool counts:
click_calls, screenshot_calls, etc.
sandbox_ready_wait_time: Sandbox startup time
Advanced Usage
Custom Datasets
Create task-specific datasets:
def create_custom_dataset() -> Dataset:
return Dataset.from_dict({
"question": [
"Find the price of the first product",
"What is the company's contact email?",
],
"answer": [
"$29.99",
"contact@example.com",
],
"start_url": [
"https://shop.example.com",
"https://example.com/contact",
],
})
Proxy Configuration
BrowserEnv(
mode="dom",
proxies=True, # Enable Browserbase proxies
# Proxies allow geo-targeting and IP rotation
)
Screenshot Management
BrowserEnv(
mode="cua",
save_screenshots=True, # Save to disk
keep_recent_screenshots=3, # Keep last 3 in context
# Balances context size with visual information
)
Next Steps