Automation approaches

Overview

Libretto supports four distinct approaches to capturing data and automating web interactions. Each makes different trade-offs between detection risk, setup complexity, data quality, and control. Understanding the trade-offs helps you pick the right tool for your target site.

Approach	Bot detection risk	Best for
Regular Playwright	Moderate	Simple DOM extraction, server-rendered sites
Passive interception (`page.on('response')`)	Low	SPAs that load data via API calls during navigation
In-browser fetch (`pageRequest()`)	Low to moderate	Deep pagination, bulk queries without UI clicking
Direct HTTP from Node.js	Very high	Public/documented APIs with no bot detection

Recommended hybrid: Combine Regular Playwright for navigation with passive page.on('response') interception for data capture. This gives you browser-based reliability with structured API data quality at minimal detection risk.

Approach details

Regular Playwright
Passive Interception
In-Browser Fetch
Direct HTTP

Standard Playwright usage — navigate pages, click elements, fill forms, and read DOM content using selectors and page.evaluate().

// Navigate and interact
await page.goto('https://example.com/search');
await page.fill('#query', 'search term');
await page.click('#submit');
await page.waitForSelector('.results');

// Extract data from the DOM
const results = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('.result-item')).map(el => ({
    title: el.querySelector('h2')?.textContent,
    price: el.querySelector('.price')?.textContent,
  }));
});

Pros:

Simplest approach — uses Playwright as intended
No need to understand the site’s API structure
Works with any site regardless of how data is rendered (server-side, client-side, or hybrid)
Data extraction is visual/DOM-based, which maps naturally to what a user sees
Easy to debug with headless: false and Playwright’s trace viewer
Integrates directly with Libretto’s step-based workflow, recovery, and extraction features

Cons:

Slower than API-based approaches — requires full page rendering
Fragile against DOM changes — selectors break when the site updates its markup
Harder to get structured data — you’re scraping rendered HTML rather than clean API responses
Cannot access data that isn’t rendered in the DOM (e.g., API responses with fields the UI doesn’t display)

Bot detection risk: MODERATEPlain Playwright is detectable by browser fingerprinting (Layer 1). Sites with any enterprise bot protection will likely flag it. Sites without active detection won’t notice.

Use playwright-extra with the stealth plugin to patch common fingerprint leaks, or run Playwright with a persistent browser context that looks more like a real browser profile.

Listen to network responses the browser naturally makes as you navigate. You don’t make any extra requests — you just capture the data flowing through.

const capturedData: any[] = [];

page.on('response', async (response) => {
  const url = response.url();
  if (url.includes('/api/search/results')) {
    const json = await response.json();
    capturedData.push(json);
  }
});

// Trigger the data load by interacting with the UI normally
await page.goto('https://example.com/search?q=term');
await page.waitForSelector('.results');
// capturedData now has the raw API response

Pros:

Zero additional bot detection risk from network requests — you’re not making any extra calls. The requests that happen are the ones the site’s own code triggers.
Gets clean, structured API data (JSON) rather than scraped DOM content
API responses often contain more data than the UI displays (hidden fields, IDs, metadata)
Not fragile against DOM changes — the API contract tends to be more stable than CSS selectors
Works with Playwright’s existing page context — no additional setup

Cons:

You only get data that the page naturally loads — you must trigger the right UI flow to cause the requests you need
Still requires Playwright browser automation to drive the page, so you have the browser fingerprinting risk for the navigation itself
Timing can be tricky — you must set up the listener before the navigation that triggers the request
Responses may be paginated or partial — the site’s UI might lazy-load data, requiring you to trigger scrolling or “load more” interactions
If the site uses GraphQL or batched API calls, parsing the right data out of responses requires understanding the API structure
Some responses may be encrypted or obfuscated by bot protection services

Bot detection risk: LOWThe network requests themselves carry zero additional risk since they originate from the site’s own JavaScript. The only risk is from the browser automation layer needed to drive the UI. No extra fetch calls means no anomalous network patterns for API-level monitoring to flag.

Execute fetch calls from within the browser page’s JavaScript context. The requests originate from the browser process itself with all the right credentials and fingerprints. Libretto’s pageRequest() function provides a typed wrapper for this pattern.

const data = await page.evaluate(async () => {
  const res = await fetch('/api/search/results?q=term&page=2', {
    headers: {
      'Content-Type': 'application/json',
      'X-Requested-With': 'XMLHttpRequest',
    },
  });
  return res.json();
});

Pros:

Requests come from the real browser — same TLS fingerprint, same cookies, same origin, same HTTP/2 settings. From the server’s perspective, it looks identical to a request the site’s own JS would make.
Full control over which endpoints you call and with what parameters — no need to trigger UI flows
Can call endpoints the UI doesn’t naturally hit (e.g., fetch page 50 of results without clicking “next” 49 times)
Gets clean, structured API data (JSON)
Faster than driving the UI — skip page rendering and go straight to the data

Cons:

Requires understanding the site’s API — you need to know the endpoint URLs, required headers, authentication tokens, and request body format. This requires reverse-engineering the site’s network traffic first.
Vulnerable to fetch/XHR monkey-patching — if the site wraps window.fetch, your calls may be intercepted and flagged because the call stack won’t match the site’s expected code paths
Still requires a Playwright browser to be running (for the execution context)
API endpoints can change without notice
Must handle authentication tokens and CSRF tokens that the site’s own code normally manages

Bot detection risk: LOW to MODERATEThe network-level risk is very low — the requests are genuine browser requests. The risk comes from browser fingerprinting (same as regular Playwright), fetch/XHR monkey-patching detecting unexpected call stacks, and timing/pattern analysis if your requests don’t match normal UI flow patterns.

Most sites do not implement fetch call stack monitoring. This approach is effectively undetectable on the vast majority of sites. Only sites with enterprise-grade bot protection from services like PerimeterX or Shape Security are likely to catch this.

Make HTTP requests directly from Node.js using fetch, axios, got, or similar libraries. No browser involved.

import axios from 'axios';

const response = await axios.get('https://example.com/api/search/results', {
  params: { q: 'term', page: 1 },
  headers: {
    'User-Agent': 'Mozilla/5.0 ...',
    'Cookie': 'session=abc123; ...',
  },
});
const data = response.data;

Pros:

Fastest approach — no browser overhead, no page rendering, minimal memory usage
Simple code — just HTTP requests, no browser lifecycle management
Easy to parallelize — make many concurrent requests without launching multiple browser instances
Lowest resource consumption — suitable for high-volume data collection

Cons:

No cookies unless manually managed — you must extract cookies from a browser session and replicate them, including HttpOnly cookies you can’t access from JS
No browser-specific headers — sec-ch-ua, sec-fetch-*, and other headers that browsers add automatically must be manually fabricated
No JavaScript execution — if the site requires JS to set cookies, generate tokens, or solve challenges, you can’t do it
CSRF and auth tokens must be manually extracted and refreshed
Breaks easily — API changes, new security headers, or updated bot protection will break requests with no fallback

Bot detection risk: VERY HIGH. This approach is detectable at nearly every layer. TLS fingerprinting alone will catch Node.js HTTP clients on any site with even basic bot protection — the TLS fingerprint is fundamentally different from any real browser, and this is one of the strongest detection signals. This approach only works reliably against sites with zero bot detection, or against documented public APIs that expect programmatic access.

Comparison matrix

Criteria	Regular Playwright	Passive interception	In-browser fetch	Direct HTTP
Bot detection risk	Moderate	Low	Low–Moderate	Very High
Browser fingerprint risk	Yes	Yes	Yes	N/A (wrong fingerprint)
Network fingerprint risk	None (browser requests)	None (browser requests)	None (browser requests)	Very High
API monitoring risk	None	None	Low (fetch patching)	N/A
Data quality	DOM-dependent	Structured JSON	Structured JSON	Structured JSON
Setup complexity	Low	Medium	Medium–High	Low–Medium
API reverse-engineering needed	No	Partial (identify endpoints)	Yes (full)	Yes (full)
Control over data fetching	Low	Low	High	High
Speed	Slow	Medium	Medium–Fast	Fast
Resource usage	High	High	High	Low
Resilience to DOM changes	Low	High	High	High
Resilience to API changes	Medium	Low	Low	Low

Decision guide

Use Regular Playwright when:

The data you need is visible in the DOM and straightforward to extract with selectors
The site doesn’t have aggressive bot protection, or you’re using stealth plugins
You want the simplest implementation that integrates with Libretto’s recovery and extraction features
The data is rendered server-side and doesn’t come from a separate API call

Use passive interception (page.on('response')) when:

The site loads data via API calls during normal navigation (most modern SPAs)
You want structured JSON data without reverse-engineering the full API
Minimizing detection risk is important
You’re already navigating through the UI and want to passively capture data along the way

Use in-browser fetch (pageRequest()) when:

You need data from API endpoints that the UI doesn’t naturally trigger (e.g., deep pagination, bulk exports)
You’ve verified the site doesn’t monkey-patch fetch (or you can work around it)
You want maximum control over which data you fetch and when
You’ve already reverse-engineered the relevant API endpoints

Use Direct Node.js HTTP when:

The target site has zero bot detection
Speed and resource efficiency are the primary concerns
You’re hitting a public/documented API (not scraping a website)
You need to make thousands of concurrent requests

Recommended hybrid

For most browser automation workflows, combine Approach 1 and Approach 2: use Regular Playwright to navigate and interact with the site (handling popups, login flows, and anything requiring UI interaction with Libretto’s recovery features), and passively intercept API responses with page.on('response') to capture structured data.This gives you the reliability of browser-based navigation with the data quality of API responses, at minimal detection risk.

Bot detection

Understand how sites detect automation and which signals each approach exposes.

Sessions and profiles

Manage named browser sessions and persist authenticated state across runs.

Use Cases

Concepts

Automation approaches

Overview

Approach details

Comparison matrix

Decision guide

Recommended hybrid

Bot detection

Sessions and profiles

Build docs developers (and LLMs) love

Use Cases

Concepts

​Overview

​Approach details

​Comparison matrix

​Decision guide

​Recommended hybrid

​Related pages

Bot detection

Sessions and profiles

Build docs developers (and LLMs) love

Overview

Approach details

Comparison matrix

Decision guide

Recommended hybrid

Related pages