Documentation Index
Fetch the complete documentation index at: https://mintlify.com/browserbase/stagehand/llms.txt
Use this file to discover all available pages before exploring further.
Stagehand uses advanced caching strategies to reduce latency and token costs. This includes prompt caching for repeated content and conversation history compression for long-running agents.
Overview
Caching strategies in Stagehand:
- Prompt caching - Cache system prompts and static content
- Image compression - Reduce token usage in conversation history
- Conversation management - Maintain context while minimizing tokens
- Provider-specific optimizations - Leverage native caching features
Prompt Caching
Anthropic Prompt Caching
Anthropic supports caching with cache control blocks. Stagehand automatically uses this for system prompts and accessibility trees.
How it works:
// System prompt with caching
const messages = [
{
role: "system",
content: [
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" }, // Cache this content
},
],
},
// User messages...
];
Benefits:
- System prompts are cached across requests
- Reduces input token costs by ~90% for cached content
- Cache persists for 5 minutes of inactivity
- Particularly effective for accessibility trees
Accessibility Tree Caching:
Location: Various act/extract implementations
const ariaTree = await page.getAriaTree();
const messages = [
{
role: "user",
content: [
{
type: "text",
text: `Accessibility tree:\n${ariaTree}`,
cache_control: { type: "ephemeral" }, // Cache the tree
},
{
type: "text",
text: instruction,
},
],
},
];
Token Savings Example:
// First request: 5000 input tokens
// Subsequent requests with cache: 500 input tokens (90% reduction)
// Cache hit charges: ~10% of uncached cost
OpenAI Prompt Caching
OpenAI does not currently support explicit prompt caching, but Stagehand optimizes requests by:
- Reusing system prompts across calls
- Minimizing message history
- Structuring requests for potential future caching support
Google Prompt Caching
Google’s caching is handled automatically by the model. Stagehand optimizes by:
- Structuring system instructions consistently
- Reusing conversation history format
- Minimizing changes to cached content
Image Compression
Anthropic Image Compression
Location: packages/core/lib/v3/agent/utils/imageCompression.ts
Strategy:
- Keep first 2 images in conversation at full quality
- Compress all subsequent images to 25% quality
- Reduces token usage while maintaining context
Implementation:
export function compressConversationImages(
items: ResponseInputItem[],
keepFirstN = 2,
): void {
let imageCount = 0;
for (const item of items) {
if ("role" in item && item.role === "user") {
const content = item.content;
if (Array.isArray(content)) {
for (const block of content) {
if (block.type === "image") {
imageCount++;
if (imageCount > keepFirstN) {
// Compress this image
const base64Data = block.source.data;
const buffer = Buffer.from(base64Data, "base64");
const compressed = await sharp(buffer)
.jpeg({ quality: 25 })
.toBuffer();
block.source.data = compressed.toString("base64");
}
}
}
}
}
}
}
Usage in CUA:
// In AnthropicCUAClient.ts
const nextInputItems: ResponseInputItem[] = [...inputItems];
// Compress images before adding new message
compressConversationImages(nextInputItems);
nextInputItems.push(assistantMessage);
nextInputItems.push(userToolResultsMessage);
Token Savings:
// Full quality image: ~1500 tokens
// 25% quality image: ~400 tokens
// Savings: ~73% per compressed image
Google Image Compression
Location: packages/core/lib/v3/agent/utils/imageCompression.ts
Implementation:
export function compressGoogleConversationImages(
items: Content[],
keepFirstN = 2,
): { items: Content[]; compressed: boolean } {
let imageCount = 0;
let compressed = false;
for (const item of items) {
if (item.role === "user" && item.parts) {
for (const part of item.parts) {
if (part.inlineData?.mimeType === "image/png") {
imageCount++;
if (imageCount > keepFirstN) {
// Compress to JPEG 25%
const buffer = Buffer.from(part.inlineData.data, "base64");
const compressedBuffer = await sharp(buffer)
.jpeg({ quality: 25 })
.toBuffer();
part.inlineData.data = compressedBuffer.toString("base64");
part.inlineData.mimeType = "image/jpeg";
compressed = true;
}
}
}
}
}
return { items, compressed };
}
Usage:
// In GoogleCUAClient.ts:executeStep()
const compressedResult = compressGoogleConversationImages(
this.history,
2, // Keep first 2 images
);
const compressedHistory = compressedResult.items;
const response = await this.client.models.generateContent({
model: this.modelName,
contents: compressedHistory,
config: this.generateContentConfig,
});
Conversation History Management
CUA Conversation History
All CUA clients maintain conversation history to preserve context:
Anthropic Pattern:
private async executeStep(
inputItems: ResponseInputItem[],
logger: (message: LogLine) => void,
): Promise<{ /* ... */ }> {
// Get model response
const result = await this.getAction(inputItems);
// Build next input items
const nextInputItems: ResponseInputItem[] = [...inputItems];
// Compress images
compressConversationImages(nextInputItems);
// Add assistant message
nextInputItems.push(assistantMessage);
// Add tool results
if (toolResults.length > 0) {
nextInputItems.push(userToolResultsMessage);
}
return { nextInputItems, /* ... */ };
}
Google Pattern:
private history: Content[] = [];
async executeStep(logger: (message: LogLine) => void) {
// Compress history before request
const compressedResult = compressGoogleConversationImages(this.history, 2);
const compressedHistory = compressedResult.items;
// Get response
const response = await this.client.models.generateContent({
contents: compressedHistory,
// ...
});
// Add to history
this.history.push(sanitizedContent);
if (functionResponses.length > 0) {
this.history.push({
role: "user",
parts: functionResponses,
});
}
}
OpenAI Pattern:
private reasoningItems: Map<string, ResponseItem> = new Map();
async executeStep(
inputItems: ResponseInputItem[],
previousResponseId: string | undefined,
) {
// Use previous_response_id for history
const requestParams = {
model: this.modelName,
input: inputItems,
previous_response_id: previousResponseId,
};
const response = await this.client.responses.create(requestParams);
// Track reasoning items
for (const item of response.output) {
if (item.type === "reasoning") {
this.reasoningItems.set(item.id, item);
}
}
return { responseId: response.id };
}
History Truncation Strategies
Keep recent messages:
function truncateHistory(
history: ResponseInputItem[],
maxMessages = 10,
): ResponseInputItem[] {
// Always keep system message
const systemMessages = history.filter((m) => m.role === "system");
const otherMessages = history.filter((m) => m.role !== "system");
// Keep last N messages
const recentMessages = otherMessages.slice(-maxMessages);
return [...systemMessages, ...recentMessages];
}
Token-based truncation:
function truncateByTokens(
history: ResponseInputItem[],
maxTokens = 100000,
): ResponseInputItem[] {
const systemMessages = history.filter((m) => m.role === "system");
const otherMessages = history.filter((m) => m.role !== "system").reverse();
let tokenCount = estimateTokens(systemMessages);
const keptMessages: ResponseInputItem[] = [];
for (const message of otherMessages) {
const messageTokens = estimateTokens([message]);
if (tokenCount + messageTokens > maxTokens) break;
keptMessages.unshift(message);
tokenCount += messageTokens;
}
return [...systemMessages, ...keptMessages];
}
Provider-Specific Optimizations
Anthropic Cache Control
// Mark content for caching
const messages = [
{
role: "system",
content: [
{
type: "text",
text: longSystemPrompt,
cache_control: { type: "ephemeral" },
},
],
},
];
// First request: Full token count
// Subsequent requests: Cache hit (10% cost)
Google Content Reuse
// Structure content consistently for better caching
this.generateContentConfig = {
temperature: 1,
topP: 0.95,
topK: 40,
maxOutputTokens: 8192,
tools: [{
computerUse: { environment: this.environment },
}],
};
// Reuse config across requests
const response = await this.client.models.generateContent({
model: this.modelName,
contents: compressedHistory,
config: this.generateContentConfig, // Consistent config
});
OpenAI Response Chaining
// Use previous_response_id to chain requests
let previousResponseId: string | undefined;
for (let step = 0; step < maxSteps; step++) {
const response = await this.client.responses.create({
model: this.modelName,
input: inputItems,
previous_response_id: previousResponseId, // Link to previous
});
previousResponseId = response.id;
}
Performance Monitoring
Track Token Usage
let totalInputTokens = 0;
let totalOutputTokens = 0;
let totalCachedTokens = 0;
while (!completed && currentStep < maxSteps) {
const result = await this.executeStep(inputItems, logger);
totalInputTokens += result.usage.input_tokens;
totalOutputTokens += result.usage.output_tokens;
if (result.usage.cached_input_tokens) {
totalCachedTokens += result.usage.cached_input_tokens;
}
currentStep++;
}
console.log("Token usage:", {
input: totalInputTokens,
output: totalOutputTokens,
cached: totalCachedTokens,
savings: `${((totalCachedTokens / totalInputTokens) * 100).toFixed(1)}%`,
});
Log Compression Results
const before = estimateSize(inputItems);
compressConversationImages(inputItems);
const after = estimateSize(inputItems);
logger({
category: "caching",
message: `Compressed images: ${before}KB → ${after}KB (${((1 - after / before) * 100).toFixed(1)}% reduction)`,
level: 2,
});
Best Practices
- Use prompt caching: Mark static content with cache_control
- Compress images: Keep first 2 at full quality, compress rest
- Truncate history: Don’t let conversation grow unbounded
- Monitor token usage: Track input/output/cached tokens
- Structure consistently: Consistent structure improves caching
- Batch operations: Fewer requests = better cache utilization
- Use appropriate models: Faster models for cached content
Cost Optimization
Example savings with caching:
// Without caching:
// 10 requests × 5000 input tokens = 50,000 tokens
// Cost: $0.15 (at $3/1M tokens)
// With prompt caching (4000 tokens cached):
// Request 1: 5000 input tokens = $0.015
// Requests 2-10: 1000 new + 400 cached = 1400 tokens each
// Cost: $0.015 + (9 × $0.0042) = $0.053
// Savings: 65%
With image compression:
// Full quality: 10 images × 1500 tokens = 15,000 tokens
// Compressed: 2 full + 8 compressed (400 tokens) = 6,200 tokens
// Savings: 59%
References
- Image Compression:
packages/core/lib/v3/agent/utils/imageCompression.ts
- Anthropic CUA:
packages/core/lib/v3/agent/AnthropicCUAClient.ts:351
- Google CUA:
packages/core/lib/v3/agent/GoogleCUAClient.ts:357
- OpenAI CUA:
packages/core/lib/v3/agent/OpenAICUAClient.ts:420