Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/goetzcj/web-to-markdown/llms.txt

Use this file to discover all available pages before exploring further.

Quick Start

Learn how to fetch webpages and convert them to clean markdown with three common use cases.

Basic Usage

Import the core functions from the script:
from scripts.fetch_as_markdown import fetch_as_markdown, fetch_api_spec

Example 1: Fetch a Static Page

For traditional server-rendered pages, the static fetch (~1s) will handle it automatically:
from scripts.fetch_as_markdown import fetch_as_markdown

# Fetch any page — static first, headless browser fallback if needed
markdown = fetch_as_markdown("https://docs.example.com/api")
print(markdown)
The two-stage strategy means you don’t have to know whether a page is static or JavaScript-rendered — it will try the fast path first and automatically fall back if needed.

Example 2: Fetch a JavaScript-Heavy Page

For known SPAs, React documentation, or Swagger UI instances, skip straight to the browser:
from scripts.fetch_as_markdown import fetch_as_markdown

# Known JS-heavy target (SPA, Swagger UI, React docs) — skip straight to browser
markdown = fetch_as_markdown(
    "https://app.example.com/swagger",
    playwright_first=True
)
print(markdown)
Setting playwright_first=True skips the static HTTP request entirely and goes directly to the headless Chromium browser. Use this when you know the target is JavaScript-rendered to save a failed static request.

Example 3: Fetch an API Specification

For OpenAPI/Swagger specs, use fetch_api_spec which checks the Content-Type header first:
from scripts.fetch_as_markdown import fetch_api_spec

# API docs — returns raw JSON/YAML if the server provides it, markdown otherwise
spec = fetch_api_spec("https://api.example.com/openapi.json")
print(spec)
If the server returns application/json or application/yaml in the Content-Type header, you’ll get the raw spec directly. This is useful because many agents can parse OpenAPI specs natively without needing a markdown representation. If the URL points to an HTML documentation page instead of a raw spec file, it falls back to fetch_as_markdown automatically.

Error Handling

Errors are returned as strings prefixed with "ERROR:" rather than raised as exceptions:
from scripts.fetch_as_markdown import fetch_as_markdown

result = fetch_as_markdown("https://login-required.example.com")

if result.startswith("ERROR:"):
    print(f"Failed to fetch: {result}")
else:
    print(f"Success! Got {len(result)} characters of markdown")
This design means agents can handle errors inline without try/catch blocks.

Common Error Messages

ERROR: Page appears JavaScript-rendered but Playwright is not installed.
Run: pip install playwright && playwright install chromium

Using the CLI

You can also use the script from the command line without writing any Python code:
python scripts/fetch_as_markdown.py https://docs.example.com/getting-started

How the Two-Stage Fetch Works

Under the hood, fetch_as_markdown() implements this flow:
1

Static Fetch (Fast Path)

Sends a standard HTTP request with browser-like headers (~1 second)
  • Runs the HTML through readability to strip navigation, ads, sidebars
  • Converts to markdown with html2text
  • If the result has ≥200 characters of real text, returns it immediately
2

Content Validation

Checks if the markdown is “thin” (less than 200 characters after whitespace normalization)This threshold catches JavaScript-gated shells that return empty <div id="app"></div> elements without falsely flagging legitimately short pages.
3

Playwright Fallback (Slow Path)

If static fetch returned thin content, automatically launches headless Chromium (~5-8 seconds)
  • Waits for networkidle event plus 3 seconds for JavaScript frameworks to finish rendering
  • Runs the fully-rendered HTML through the same readability → html2text pipeline
  • Returns the result if it has enough content
4

Error Detection

If even Playwright returns thin content, returns an error string explaining the page is likely behind a login wall or blocking automated access
You never have to think about this flow — just call fetch_as_markdown() and it handles everything automatically.

Next Steps

Framework Integration

Learn how to integrate with Agno, LangChain, CrewAI, and other agent frameworks

API Reference

Detailed documentation of all functions and parameters

Build docs developers (and LLMs) love