Web-to-Markdown

An agent skill that fetches any webpage and returns clean, content-focused markdown — handling JavaScript-rendered pages automatically so your agent never has to.

The Problem

Raw HTML is a terrible format for agents. It’s bloated with nav menus, cookie banners, sidebars, and scripts that have nothing to do with the actual content. And on modern sites, a plain HTTP request often returns an empty JavaScript shell — the agent sees nothing useful and either hallucinates or gives up.

The Solution

This skill solves both problems with a two-stage fetch strategy:

Fast static request first (~1s) — works for most traditional sites
Automatic headless browser fallback (~5-8s) — handles JavaScript-rendered content when needed
Intelligent content extraction — strips boilerplate using the same algorithm as Firefox Reader Mode
Clean markdown conversion — returns only the content that matters

The result: 60–80% fewer tokens than raw HTML, with only the content that matters.

Key Features

Two-Stage Fetch Strategy

Fast static HTTP first, automatic Playwright fallback for JS-heavy pages. Your agent never has to decide which method to use.

Battle-Tested Content Extraction

Uses the Firefox Reader Mode algorithm (readability-lxml) to strip navigation, ads, sidebars, and other boilerplate from millions of real-world pages.

Massive Token Reduction

60-80% fewer tokens than raw HTML by removing scripts, styles, navigation menus, cookie banners, and other noise.

Framework Agnostic

Core script has zero framework dependencies. Wrap in 5-10 lines for Agno, LangChain, CrewAI, OpenAI Agents SDK, or any other framework.

Graceful Error Handling

Errors returned as strings prefixed with “ERROR:” rather than raised exceptions — agents can handle them inline without try/catch.

API-Spec Aware

Automatically detects and returns raw JSON/YAML for OpenAPI specs when the server provides them, falling back to markdown conversion only when needed.

How It Works

The two-stage fetch strategy intelligently handles both static and JavaScript-rendered pages:

fetch_as_markdown(url)
  │
  ├─ Static fetch (fast, ~1s)
  │    └─ readability → html2text → clean markdown
  │         ├─ ≥200 chars of real text? → return it
  │         └─ Thin/empty?              → fall through ↓
  │
  └─ Playwright fetch (headless Chromium, ~5-8s)
       └─ readability → html2text → clean markdown
            ├─ Enough content? → return it
            └─ Still empty?    → ERROR: login wall or bot block

“Thin content” means less than 200 characters after whitespace normalization. This catches JS-gated shells without falsely flagging legitimately short pages.

Get Started

Core Concepts

Usage

Framework Integration

Introduction

Web-to-Markdown

The Problem

The Solution

Key Features

Two-Stage Fetch Strategy

Battle-Tested Content Extraction

Massive Token Reduction

Framework Agnostic

Graceful Error Handling

API-Spec Aware

How It Works

Next Steps

Installation

Quick Start

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

Framework Integration

Documentation Index

​Web-to-Markdown

​The Problem

​The Solution

​Key Features

Two-Stage Fetch Strategy

Battle-Tested Content Extraction

Massive Token Reduction

Framework Agnostic

Graceful Error Handling

API-Spec Aware

​How It Works

​Next Steps

Installation

Quick Start

Build docs developers (and LLMs) love

Web-to-Markdown

The Problem

The Solution

Key Features

How It Works

Next Steps