Python API

The Python API provides two main functions for fetching web content: fetch_as_markdown() for general webpage fetching and fetch_api_spec() for API documentation.

Importing

Import the functions directly from the script:

from scripts.fetch_as_markdown import fetch_as_markdown, fetch_api_spec

fetch_as_markdown

Function Signature

def fetch_as_markdown(url: str, playwright_first: bool = False) -> str:
    """
    Fetch a URL and return clean markdown.

    Args:
        url:              Full URL including scheme (https://...)
        playwright_first: Skip static fetch; go straight to headless browser.
                          Use for known JS-heavy targets (SPAs, Swagger UI, etc.)

    Returns:
        Clean markdown string, or an error message prefixed with "ERROR:"
    """

Parameters

url

str

required

Full URL including the scheme (must start with https:// or http://)

playwright_first

bool

default:"False"

When True, skips the fast static HTTP request and goes directly to the headless browser. Use this for known JavaScript-heavy targets like SPAs, Swagger UI, or React documentation sites.

Return Value

Returns a string containing:

Clean markdown on success (60-80% fewer tokens than raw HTML)
Error message prefixed with "ERROR:" on failure

Errors are returned as strings rather than raised exceptions, so agents can handle them inline without try/catch blocks.

Basic Usage

from scripts.fetch_as_markdown import fetch_as_markdown

markdown = fetch_as_markdown("https://docs.example.com/api")
print(markdown)

How It Works

The function uses a two-stage fetch strategy:

Static fetch (default, ~1s)
- Fast HTTP request with browser-like User-Agent
- Applies readability algorithm to strip nav/ads/sidebars
- Converts to markdown
- If ≥200 chars of real text, returns immediately
Playwright fallback (if content is thin, ~5-8s)
- Launches headless Chromium
- Waits for network idle + 3 seconds for JS frameworks
- Same readability → markdown pipeline
- If still thin, returns error message

The 200-character threshold (after whitespace collapse) catches JavaScript-gated shells without falsely flagging legitimately short pages.

Error Messages

Missing Playwright
Fetch Failed
Login Wall

ERROR: Page appears JavaScript-rendered but Playwright is not installed. 
Run: pip install playwright && playwright install chromium

Returned when static fetch gets thin content but Playwright isn’t available for the browser fallback.

ERROR: Could not fetch https://example.com. The page may require 
authentication or block automated access.

Returned when the headless browser cannot access the page (network error, bot detection, etc.).

ERROR: Fetched https://example.com but content appears to be behind a 
login wall or requires user interaction that cannot be automated.

Returned when Playwright successfully loads the page but still gets thin/empty content after rendering.

fetch_api_spec

Function Signature

def fetch_api_spec(url: str) -> str:
    """
    Fetch API documentation or an OpenAPI/Swagger spec.

    Checks the Content-Type header first — if the server returns raw JSON or YAML,
    that's returned directly since agents can often work with OpenAPI specs natively
    without needing markdown conversion. Falls back to fetch_as_markdown otherwise.

    Args:
        url: URL of the API docs page or raw spec file

    Returns:
        Raw spec (JSON/YAML) or clean markdown of the docs page
    """

Parameters

url

str

required

URL of the API documentation page or raw OpenAPI/Swagger spec file

Return Value

Returns a string containing:

Raw JSON/YAML if the server returns Content-Type: application/json, application/yaml, or text/yaml
Clean markdown from fetch_as_markdown() otherwise

Usage Examples

from scripts.fetch_as_markdown import fetch_api_spec

# Server returns application/json → gets raw JSON
spec = fetch_api_spec("https://api.example.com/openapi.json")

import json
data = json.loads(spec)  # Direct parsing works
print(f"API version: {data['info']['version']}")

When to Use

Use fetch_api_spec() instead of fetch_as_markdown() when:

Fetching OpenAPI/Swagger specifications
The target URL might return raw JSON/YAML
You want agents to work with native spec formats
Dealing with API documentation that may be in multiple formats

If fetch_api_spec() falls back to markdown conversion (because the server returned HTML), it automatically uses playwright_first=True behavior since API docs are often JavaScript-rendered.

Advanced Examples

Batch Processing

from scripts.fetch_as_markdown import fetch_as_markdown
from concurrent.futures import ThreadPoolExecutor

urls = [
    "https://docs.example.com/intro",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/guides",
]

def fetch_and_save(url: str) -> None:
    filename = url.split("/")[-1] + ".md"
    markdown = fetch_as_markdown(url)
    
    if not markdown.startswith("ERROR:"):
        with open(filename, "w", encoding="utf-8") as f:
            f.write(markdown)
        print(f"✓ Saved {filename}")
    else:
        print(f"✗ Failed: {url} - {markdown}")

with ThreadPoolExecutor(max_workers=3) as executor:
    executor.map(fetch_and_save, urls)

Integration with Agent Systems

from scripts.fetch_as_markdown import fetch_as_markdown, fetch_api_spec

class WebResearchAgent:
    def fetch_documentation(self, url: str, is_api: bool = False) -> dict:
        """
        Fetch documentation and return structured result.
        """
        if is_api:
            content = fetch_api_spec(url)
        else:
            content = fetch_as_markdown(url)
        
        return {
            "url": url,
            "success": not content.startswith("ERROR:"),
            "content": content,
            "tokens": len(content.split()),
        }

agent = WebResearchAgent()
result = agent.fetch_documentation("https://docs.python.org/3/library/asyncio.html")

if result["success"]:
    print(f"Fetched {result['tokens']} tokens from {result['url']}")

Custom Content Validation

from scripts.fetch_as_markdown import fetch_as_markdown
import re

def fetch_with_validation(url: str, required_keywords: list[str]) -> str:
    """
    Fetch markdown and validate it contains expected content.
    """
    markdown = fetch_as_markdown(url)
    
    if markdown.startswith("ERROR:"):
        return markdown
    
    # Check for required keywords
    missing = [kw for kw in required_keywords if kw.lower() not in markdown.lower()]
    
    if missing:
        return f"ERROR: Content missing required keywords: {', '.join(missing)}"
    
    return markdown

# Validate API docs contain authentication info
docs = fetch_with_validation(
    "https://api.example.com/docs",
    required_keywords=["authentication", "API key", "authorization"]
)

Dependencies

Ensure these packages are installed:

pip install requests readability-lxml html2text playwright
playwright install chromium  # ~200MB one-time download

Playwright is optional. If a JavaScript-rendered page is encountered without it, you’ll get a clear error message telling you exactly what to install.

Get Started

Core Concepts

Usage

Framework Integration

Importing

fetch_as_markdown

Function Signature

Parameters

Return Value

Basic Usage

How It Works

Error Messages

fetch_api_spec

Function Signature

Parameters

Return Value

Usage Examples

When to Use

Advanced Examples

Batch Processing

Integration with Agent Systems

Custom Content Validation

Dependencies

See Also

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

Framework Integration

Documentation Index

​Importing

​fetch_as_markdown

​Function Signature

​Parameters

​Return Value

​Basic Usage

​How It Works

​Error Messages

​fetch_api_spec

​Function Signature

​Parameters

​Return Value

​Usage Examples

​When to Use

​Advanced Examples

​Batch Processing

​Integration with Agent Systems

​Custom Content Validation

​Dependencies

​See Also

Build docs developers (and LLMs) love

Importing

fetch_as_markdown

Function Signature

Parameters

Return Value

Basic Usage

How It Works

Error Messages

fetch_api_spec

Function Signature

Parameters

Return Value

Usage Examples

When to Use

Advanced Examples

Batch Processing

Integration with Agent Systems

Custom Content Validation

Dependencies

See Also