Error Handling

Overview

web-to-markdown uses an error-as-string pattern instead of raising exceptions. All errors are returned as strings prefixed with "ERROR:", making them easy for agents to detect and handle inline without try/catch blocks.

The Error-as-String Pattern

How It Works

Instead of this:

try:
    markdown = fetch_as_markdown(url)
    process(markdown)
except FetchError as e:
    print(f"Failed: {e}")

You write this:

markdown = fetch_as_markdown(url)
if markdown.startswith("ERROR:"):
    print(f"Failed: {markdown}")
else:
    process(markdown)

Why This Pattern?

This design is optimized for agent workflows where:

Agents don’t handle exceptions well: Most LLM-based agents struggle with try/catch control flow
Errors are data: Error messages are useful context that agents can reason about
No silent failures: Unlike None or empty strings, “ERROR:” strings are explicit and detectable
Simplified integration: Framework tools can return error strings without special exception handling

The error-as-string pattern follows the principle that errors should be values, not exceptions when the caller always needs to handle them.

Implementation Details

From fetch_as_markdown.py:

def fetch_as_markdown(url: str, playwright_first: bool = False) -> str:
    """
    Returns:
        Clean markdown string, or an error message prefixed with "ERROR:"
    """
    html = None

    if not playwright_first:
        html = _static_fetch(url)
        if html:
            md = _clean_markdown(_html_to_markdown(_extract_main_content(html)))
            if not _is_thin_content(md):
                return md  # ✓ Success case
            html = None

    html = _playwright_fetch(url)

    if not html:
        if not PLAYWRIGHT_AVAILABLE:
            return (  # ✗ Error case 1
                "ERROR: Page appears JavaScript-rendered but Playwright is not installed. "
                "Run: pip install playwright && playwright install chromium"
            )
        return f"ERROR: Could not fetch {url}. The page may require authentication or block automated access."  # ✗ Error case 2

    md = _clean_markdown(_html_to_markdown(_extract_main_content(html)))

    if _is_thin_content(md):
        md = _clean_markdown(_html_to_markdown(html))

    if _is_thin_content(md):
        return (  # ✗ Error case 3
            f"ERROR: Fetched {url} but content appears to be behind a login wall "
            "or requires user interaction that cannot be automated."
        )

    return md  # ✓ Success case

Notice:

No exceptions raised: All error paths return strings
Consistent prefix: All errors start with "ERROR:"
Actionable messages: Errors explain what happened and what to do

Common Error Scenarios

1. Playwright Not Installed

When it happens:

Page requires JavaScript rendering
Static fetch returns thin content (<200 chars)
Playwright library not installed or Chromium not downloaded

Error message:

ERROR: Page appears JavaScript-rendered but Playwright is not installed. Run: pip install playwright && playwright install chromium

Code path:

html = _playwright_fetch(url)

if not html:
    if not PLAYWRIGHT_AVAILABLE:  # ← Triggers here
        return "ERROR: Page appears JavaScript-rendered but Playwright is not installed..."

How to fix:

pip install playwright
playwright install chromium  # ~200MB download

Example:

from scripts.fetch_as_markdown import fetch_as_markdown

result = fetch_as_markdown("https://petstore.swagger.io")

if result.startswith("ERROR:"):
    if "Playwright is not installed" in result:
        print("Please install Playwright to fetch JS-rendered pages")
        print("Run: pip install playwright && playwright install chromium")
else:
    print(result)  # Success — process markdown

Playwright is optional for static pages but required for JavaScript-rendered content. The library will attempt static fetch first before checking for Playwright.

When it happens:

Page requires user login
Content behind authentication
Both static and Playwright fetches return thin content
Page loaded but shows “Please sign in” message

Error message:

ERROR: Fetched {url} but content appears to be behind a login wall or requires user interaction that cannot be automated.

Code path:

html = _playwright_fetch(url)  # Successfully fetches page

if html:
    md = _clean_markdown(_html_to_markdown(_extract_main_content(html)))
    
    if _is_thin_content(md):  # Less than 200 chars
        md = _clean_markdown(_html_to_markdown(html))  # Try raw HTML
    
    if _is_thin_content(md):  # ← Still thin, triggers error
        return "ERROR: ... content appears to be behind a login wall ..."

Example scenarios:

# Private GitHub repo
fetch_as_markdown("https://github.com/private-org/private-repo")
# ERROR: ... behind a login wall ...

# Paywalled article
fetch_as_markdown("https://premium-news.com/article")
# ERROR: ... behind a login wall ...

# Admin dashboard
fetch_as_markdown("https://app.example.com/admin")
# ERROR: ... behind a login wall ...

Workarounds:

Use authenticated HTTP requests (not supported by default):

# Custom implementation with authentication
import requests
from scripts.fetch_as_markdown import _extract_main_content, _html_to_markdown, _clean_markdown

def fetch_authenticated(url: str, auth_token: str) -> str:
    r = requests.get(url, headers={"Authorization": f"Bearer {auth_token}"})
    r.raise_for_status()
    html = r.text
    return _clean_markdown(_html_to_markdown(_extract_main_content(html)))

Request public documentation instead of authenticated pages
Use API endpoints that don’t require browser-based auth

The library intentionally does not support authentication to keep the API simple and secure. For authenticated content, wrap the underlying functions with your own auth logic.

3. Bot Detection and Rate Limiting

When it happens:

Website blocks automated requests
Cloudflare or similar bot protection
Rate limiting after multiple requests
CAPTCHA challenge presented

Error message:

ERROR: Could not fetch {url}. The page may require authentication or block automated access.

Code path:

html = _playwright_fetch(url)

if not html:  # ← Playwright returned None (fetch failed)
    if not PLAYWRIGHT_AVAILABLE:
        return "ERROR: ... Playwright is not installed ..."
    return "ERROR: Could not fetch {url}. The page may require authentication or block automated access."

Example scenarios:

# Cloudflare protected site
fetch_as_markdown("https://protected.example.com")
# ERROR: Could not fetch ... may require authentication or block automated access.

# Rate limited after many requests
for i in range(100):
    fetch_as_markdown(f"https://api.example.com/docs/endpoint-{i}")
# Eventually: ERROR: Could not fetch ... block automated access.

Mitigation strategies:

Add delays between requests:

import time

for url in urls:
    markdown = fetch_as_markdown(url)
    time.sleep(2)  # Be respectful of rate limits

Use playwright_first for consistent User-Agent:

fetch_as_markdown(url, playwright_first=True)
# Playwright's Chromium appears more like a real browser

Request API access instead of scraping:

# Better: Use official API
fetch_api_spec("https://api.example.com/openapi.json")

If you’re repeatedly fetching from the same domain, add a 1-2 second delay between requests to avoid triggering rate limits.

4. Network and Timeout Errors

When it happens:

DNS resolution fails
Connection timeout (>15s for static, >30s for Playwright)
SSL/TLS certificate errors
Server returns 4xx/5xx status codes

Error message:

ERROR: Could not fetch {url}. The page may require authentication or block automated access.

Implementation:

def _static_fetch(url: str, timeout: int = 15) -> str | None:
    try:
        r = requests.get(url, headers={...}, timeout=timeout)
        r.raise_for_status()  # Raises for 4xx/5xx
        return r.text
    except Exception:  # ← Catches all network errors
        return None

def _playwright_fetch(url: str, wait_ms: int = 3000) -> str | None:
    try:
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()
            page.goto(url, wait_until="networkidle", timeout=30000)  # 30s timeout
            # ...
    except Exception:  # ← Catches all Playwright errors
        return None

Example scenarios:

# Invalid URL
fetch_as_markdown("https://this-domain-does-not-exist-12345.com")
# ERROR: Could not fetch ...

# 404 Not Found
fetch_as_markdown("https://github.com/user/nonexistent-repo")
# ERROR: Could not fetch ...

# SSL error
fetch_as_markdown("https://expired-ssl-cert.example.com")
# ERROR: Could not fetch ...

The library intentionally groups all network errors into a single error message to keep the API simple. Detailed error logging would require exception handling, which breaks the error-as-string pattern.

Error Detection Patterns

Simple Check

markdown = fetch_as_markdown(url)

if markdown.startswith("ERROR:"):
    handle_error(markdown)
else:
    process_markdown(markdown)

Specific Error Handling

markdown = fetch_as_markdown(url)

if "Playwright is not installed" in markdown:
    install_playwright()
    markdown = fetch_as_markdown(url)  # Retry
elif "login wall" in markdown:
    log_skipped(url, "requires authentication")
elif markdown.startswith("ERROR:"):
    log_error(url, markdown)
else:
    process_markdown(markdown)

Framework Integration

LangChain

from langchain.tools import tool

@tool
def fetch_page(url: str) -> str:
    """Fetch webpage and return markdown. Errors are returned as strings."""
    result = fetch_as_markdown(url)
    # Agent can see and reason about error messages
    return result

CrewAI

from crewai.tools import BaseTool

class FetchTool(BaseTool):
    def _run(self, url: str) -> str:
        result = fetch_as_markdown(url)
        if result.startswith("ERROR:"):
            # Optionally log or transform errors
            return f"Failed to fetch {url}: {result}"
        return result

Agno

from scripts.agno_toolkit import WebToMarkdownTools

agent = Agent(tools=[WebToMarkdownTools()])

# Tool returns error strings directly
# Agent can read and respond to error messages

Best Practices

1. Always Check for Errors

# ✓ Good
markdown = fetch_as_markdown(url)
if markdown.startswith("ERROR:"):
    return None
process(markdown)

# ✗ Bad — assumes success
markdown = fetch_as_markdown(url)
process(markdown)  # Might process "ERROR: ..." string

2. Log Errors for Debugging

import logging

markdown = fetch_as_markdown(url)
if markdown.startswith("ERROR:"):
    logging.warning(f"Failed to fetch {url}: {markdown}")
    return None

3. Retry with Different Strategy

# Try default strategy first
markdown = fetch_as_markdown(url)

if "Playwright is not installed" in markdown:
    # Playwright needed but not available — can't retry
    return markdown
elif markdown.startswith("ERROR:"):
    # Other error — try playwright_first
    markdown = fetch_as_markdown(url, playwright_first=True)

return markdown

4. Graceful Degradation

urls = [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2",
    "https://docs.example.com/page3",
]

results = []
for url in urls:
    markdown = fetch_as_markdown(url)
    if not markdown.startswith("ERROR:"):
        results.append(markdown)
    else:
        print(f"Skipping {url}: {markdown}")

# Continue processing with whatever succeeded
process_batch(results)

Error Message Reference

Error	Cause	Solution
`ERROR: Page appears JavaScript-rendered but Playwright is not installed...`	JS-rendered page, Playwright missing	`pip install playwright && playwright install chromium`
`ERROR: Could not fetch {url}. The page may require authentication or block automated access.`	Network error, bot block, or rate limit	Check URL, add delays, or use API
`ERROR: Fetched {url} but content appears to be behind a login wall...`	Authentication required	Use public docs or implement custom auth

Debugging Tips

Enable Verbose Logging

import logging
logging.basicConfig(level=logging.DEBUG)

# Add custom logging to wrapper
def fetch_with_debug(url: str, **kwargs) -> str:
    print(f"Fetching {url} with {kwargs}")
    result = fetch_as_markdown(url, **kwargs)
    if result.startswith("ERROR:"):
        print(f"Error: {result}")
    else:
        print(f"Success: {len(result)} chars")
    return result

Test in CLI First

# See raw output and errors
python scripts/fetch_as_markdown.py https://problematic-url.com

# Try playwright_first
python scripts/fetch_as_markdown.py https://problematic-url.com --playwright-first

Check Character Count

markdown = fetch_as_markdown(url)
if not markdown.startswith("ERROR:"):
    chars = len(markdown.replace(" ", "").replace("\n", ""))
    print(f"Content: {chars} chars (threshold: 200)")

Get Started

Core Concepts

Usage

Framework Integration

Overview

The Error-as-String Pattern

How It Works

Why This Pattern?

Implementation Details

Common Error Scenarios

1. Playwright Not Installed

3. Bot Detection and Rate Limiting

4. Network and Timeout Errors

Error Detection Patterns

Simple Check

Specific Error Handling

Framework Integration

LangChain

CrewAI

Agno

Best Practices

1. Always Check for Errors

2. Log Errors for Debugging

3. Retry with Different Strategy

4. Graceful Degradation

Error Message Reference

Debugging Tips

Enable Verbose Logging

Test in CLI First

Check Character Count

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

Framework Integration

Documentation Index

​Overview

​The Error-as-String Pattern

​How It Works

​Why This Pattern?

​Implementation Details

​Common Error Scenarios

​1. Playwright Not Installed

​2. Login Walls and Authentication

​3. Bot Detection and Rate Limiting

​4. Network and Timeout Errors

​Error Detection Patterns

​Simple Check

​Specific Error Handling

​Framework Integration

​LangChain

​CrewAI

​Agno

​Best Practices

​1. Always Check for Errors

​2. Log Errors for Debugging

​3. Retry with Different Strategy

​4. Graceful Degradation

​Error Message Reference

​Debugging Tips

​Enable Verbose Logging

​Test in CLI First

​Check Character Count

Build docs developers (and LLMs) love

Overview

The Error-as-String Pattern

How It Works

Why This Pattern?

Implementation Details

Common Error Scenarios

1. Playwright Not Installed

2. Login Walls and Authentication

3. Bot Detection and Rate Limiting

4. Network and Timeout Errors

Error Detection Patterns

Simple Check

Specific Error Handling

Framework Integration

LangChain

CrewAI

Agno

Best Practices

1. Always Check for Errors

2. Log Errors for Debugging

3. Retry with Different Strategy

4. Graceful Degradation

Error Message Reference

Debugging Tips

Enable Verbose Logging

Test in CLI First

Check Character Count