Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/goetzcj/web-to-markdown/llms.txt

Use this file to discover all available pages before exploring further.

The Python API provides two main functions for fetching web content: fetch_as_markdown() for general webpage fetching and fetch_api_spec() for API documentation.

Importing

Import the functions directly from the script:
from scripts.fetch_as_markdown import fetch_as_markdown, fetch_api_spec

fetch_as_markdown

Function Signature

def fetch_as_markdown(url: str, playwright_first: bool = False) -> str:
    """
    Fetch a URL and return clean markdown.

    Args:
        url:              Full URL including scheme (https://...)
        playwright_first: Skip static fetch; go straight to headless browser.
                          Use for known JS-heavy targets (SPAs, Swagger UI, etc.)

    Returns:
        Clean markdown string, or an error message prefixed with "ERROR:"
    """

Parameters

url
str
required
Full URL including the scheme (must start with https:// or http://)
playwright_first
bool
default:"False"
When True, skips the fast static HTTP request and goes directly to the headless browser. Use this for known JavaScript-heavy targets like SPAs, Swagger UI, or React documentation sites.

Return Value

Returns a string containing:
  • Clean markdown on success (60-80% fewer tokens than raw HTML)
  • Error message prefixed with "ERROR:" on failure
Errors are returned as strings rather than raised exceptions, so agents can handle them inline without try/catch blocks.

Basic Usage

from scripts.fetch_as_markdown import fetch_as_markdown

markdown = fetch_as_markdown("https://docs.example.com/api")
print(markdown)

How It Works

The function uses a two-stage fetch strategy:
  1. Static fetch (default, ~1s)
    • Fast HTTP request with browser-like User-Agent
    • Applies readability algorithm to strip nav/ads/sidebars
    • Converts to markdown
    • If ≥200 chars of real text, returns immediately
  2. Playwright fallback (if content is thin, ~5-8s)
    • Launches headless Chromium
    • Waits for network idle + 3 seconds for JS frameworks
    • Same readability → markdown pipeline
    • If still thin, returns error message
The 200-character threshold (after whitespace collapse) catches JavaScript-gated shells without falsely flagging legitimately short pages.

Error Messages

ERROR: Page appears JavaScript-rendered but Playwright is not installed. 
Run: pip install playwright && playwright install chromium
Returned when static fetch gets thin content but Playwright isn’t available for the browser fallback.

fetch_api_spec

Function Signature

def fetch_api_spec(url: str) -> str:
    """
    Fetch API documentation or an OpenAPI/Swagger spec.

    Checks the Content-Type header first — if the server returns raw JSON or YAML,
    that's returned directly since agents can often work with OpenAPI specs natively
    without needing markdown conversion. Falls back to fetch_as_markdown otherwise.

    Args:
        url: URL of the API docs page or raw spec file

    Returns:
        Raw spec (JSON/YAML) or clean markdown of the docs page
    """

Parameters

url
str
required
URL of the API documentation page or raw OpenAPI/Swagger spec file

Return Value

Returns a string containing:
  • Raw JSON/YAML if the server returns Content-Type: application/json, application/yaml, or text/yaml
  • Clean markdown from fetch_as_markdown() otherwise

Usage Examples

from scripts.fetch_as_markdown import fetch_api_spec

# Server returns application/json → gets raw JSON
spec = fetch_api_spec("https://api.example.com/openapi.json")

import json
data = json.loads(spec)  # Direct parsing works
print(f"API version: {data['info']['version']}")

When to Use

Use fetch_api_spec() instead of fetch_as_markdown() when:
  • Fetching OpenAPI/Swagger specifications
  • The target URL might return raw JSON/YAML
  • You want agents to work with native spec formats
  • Dealing with API documentation that may be in multiple formats
If fetch_api_spec() falls back to markdown conversion (because the server returned HTML), it automatically uses playwright_first=True behavior since API docs are often JavaScript-rendered.

Advanced Examples

Batch Processing

from scripts.fetch_as_markdown import fetch_as_markdown
from concurrent.futures import ThreadPoolExecutor

urls = [
    "https://docs.example.com/intro",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/guides",
]

def fetch_and_save(url: str) -> None:
    filename = url.split("/")[-1] + ".md"
    markdown = fetch_as_markdown(url)
    
    if not markdown.startswith("ERROR:"):
        with open(filename, "w", encoding="utf-8") as f:
            f.write(markdown)
        print(f"✓ Saved {filename}")
    else:
        print(f"✗ Failed: {url} - {markdown}")

with ThreadPoolExecutor(max_workers=3) as executor:
    executor.map(fetch_and_save, urls)

Integration with Agent Systems

from scripts.fetch_as_markdown import fetch_as_markdown, fetch_api_spec

class WebResearchAgent:
    def fetch_documentation(self, url: str, is_api: bool = False) -> dict:
        """
        Fetch documentation and return structured result.
        """
        if is_api:
            content = fetch_api_spec(url)
        else:
            content = fetch_as_markdown(url)
        
        return {
            "url": url,
            "success": not content.startswith("ERROR:"),
            "content": content,
            "tokens": len(content.split()),
        }

agent = WebResearchAgent()
result = agent.fetch_documentation("https://docs.python.org/3/library/asyncio.html")

if result["success"]:
    print(f"Fetched {result['tokens']} tokens from {result['url']}")

Custom Content Validation

from scripts.fetch_as_markdown import fetch_as_markdown
import re

def fetch_with_validation(url: str, required_keywords: list[str]) -> str:
    """
    Fetch markdown and validate it contains expected content.
    """
    markdown = fetch_as_markdown(url)
    
    if markdown.startswith("ERROR:"):
        return markdown
    
    # Check for required keywords
    missing = [kw for kw in required_keywords if kw.lower() not in markdown.lower()]
    
    if missing:
        return f"ERROR: Content missing required keywords: {', '.join(missing)}"
    
    return markdown

# Validate API docs contain authentication info
docs = fetch_with_validation(
    "https://api.example.com/docs",
    required_keywords=["authentication", "API key", "authorization"]
)

Dependencies

Ensure these packages are installed:
pip install requests readability-lxml html2text playwright
playwright install chromium  # ~200MB one-time download
Playwright is optional. If a JavaScript-rendered page is encountered without it, you’ll get a clear error message telling you exactly what to install.

See Also

Build docs developers (and LLMs) love