Skip to main content

Overview

The Page Ripper endpoint captures any public web page using headless Chrome and returns a downloadable ZIP archive containing:
  • Self-contained HTML (via SingleFile)
  • Categorized assets (CSS, JavaScript, images, fonts, media)
The entire capture happens synchronously within a single HTTP request (no background jobs).
This endpoint enforces SSRF protection at multiple layers: hostname validation, DNS resolution checks, and post-redirect validation. Requests to private/internal addresses are blocked.

Endpoint

POST /api/download-page

Authentication

Requires a Supabase access token:
Authorization: Bearer <supabase_access_token>

Request Body

url
string
required
Full HTTP/HTTPS URL of the page to capture.Restrictions:
  • Must use http:// or https:// protocol
  • Cannot resolve to private/internal IP addresses
  • Cannot be localhost or reserved IP ranges (127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16, etc.)

Example Request

{
  "url": "https://example.com/landing-page"
}

Response

Success (200 OK)

Returns a ZIP file download:
HTTP/1.1 200 OK
Content-Type: application/zip
Content-Disposition: attachment; filename="example.com-2026-03-02T14-30-15.zip"
Content-Length: 4567890

[ZIP binary data]

ZIP Archive Structure

example.com-2026-03-02T14-30-15.zip
├── page.html                    # Self-contained HTML (SingleFile output)
└── assets/
    ├── css/                     # Stylesheets
    │   ├── main.css
    │   └── styles_1.css
    ├── js/                      # JavaScript files
    │   ├── app.js
    │   └── vendor.js
    ├── images/                  # PNG, JPG, SVG, WebP, AVIF, ICO, GIF
    │   ├── logo.png
    │   └── hero.jpg
    ├── fonts/                   # WOFF, WOFF2, TTF, OTF, EOT
    │   └── font.woff2
    └── media/                   # MP4, WebM, MP3, OGG, WAV
        └── video.mp4
HTML documents are not duplicated in the assets folder — page.html at the root is the only HTML file. The SingleFile library inlines critical resources directly into this HTML.

Error Codes

StatusConditionResponse Body
400Missing url field{"error": "Missing required field: url"}
400Invalid URL format{"error": "Invalid URL."}
400Non-HTTP/HTTPS protocol{"error": "Invalid URL. Must use http or https protocol."}
400Private/internal address{"error": "Requests to private/internal addresses are not allowed."}
400DNS rebinding detected{"error": "Requests to private/internal addresses are not allowed."}
401Missing bearer token{"error": "Missing Authorization bearer token."}
401Invalid/expired token{"error": "Invalid or expired session."}
405Non-POST method{"error": "Method not allowed."}
429Rate limit exceeded{"error": "Rate limit exceeded. Maximum 10 page captures per 15 minutes."}
500Capture failed{"error": "Page capture failed: <details>"}
504Timeout{"error": "Page capture timed out."}

Rate Limiting

Each authenticated user is limited to:
  • 10 captures per 15-minute window
  • Tracked via page_rip_log table keyed by user_id
Rate-limited responses include a Retry-After header (seconds until reset).

Rate Limit Response

HTTP/1.1 429 Too Many Requests
Retry-After: 900

{
  "error": "Rate limit exceeded. Maximum 10 page captures per 15 minutes."
}

SSRF Protection

The endpoint implements defense-in-depth SSRF protection:

1. Hostname Validation

Rejects obviously private hostnames (source:api/download-page.js:104-107):
Private Hostname Patterns
const PRIVATE_IP_PATTERNS = [
  /^127\./,              // Loopback (127.0.0.0/8)
  /^10\./,               // Private class A (10.0.0.0/8)
  /^172\.(1[6-9]|2\d|3[01])\./, // Private class B (172.16.0.0/12)
  /^192\.168\./,         // Private class C (192.168.0.0/16)
  /^169\.254\./,         // Link-local (169.254.0.0/16)
  /^0\.0\.0\.0$/,        // Unspecified
  /^::1$/,               // IPv6 loopback
  /^::ffff:(127\.|10\.|...)/, // IPv4-mapped IPv6 private ranges
  /^fc00:/i,             // IPv6 unique local (fc00::/7)
  /^fd00:/i,
  /^fe80:/i,             // IPv6 link-local
  /^localhost$/i
]

2. DNS Resolution Check

Resolves hostname via DNS and validates all returned IPs (source:api/download-page.js:114-134):
DNS Validation
async function assertPublicDns(hostname) {
  const addresses = await dns.resolve4(hostname).catch(() => [])
  const addresses6 = await dns.resolve6(hostname).catch(() => [])
  const allAddresses = [...addresses, ...addresses6]
  
  for (const ip of allAddresses) {
    if (PRIVATE_IP_PATTERNS.some(pattern => pattern.test(ip))) {
      throw new Error(`DNS for ${hostname} resolved to private address ${ip}`)
    }
  }
}
This prevents DNS rebinding attacks where a public hostname resolves to an internal IP.

3. Post-Redirect Validation

After Puppeteer navigation, the final URL (post-redirects) is re-validated (source:api/download-page.js:451-460):
Post-Navigation Check
const finalUrl = new URL(page.url())
if (isPrivateHostname(finalUrl.hostname)) {
  throw new Error('Redirect to private/internal address detected.')
}
await assertPublicDns(finalUrl.hostname)

Resource Limits

To prevent memory exhaustion:
LimitValueBehavior When Exceeded
Max total size100 MBStop capturing additional resources (source:api/download-page.js:46)
Max resource count500 itemsIgnore additional resources (source:api/download-page.js:49)
These limits apply to captured network resources only. The SingleFile HTML can be larger as it’s generated separately.

Timeouts

TimeoutValuePurpose
Navigation timeout60 secondsPuppeteer page load (source:api/download-page.js:13)
Hard timeout110 secondsTotal request duration (source:api/download-page.js:12)
Auto-scroll timeout15 secondsLazy-load trigger (source:api/download-page.js:332)
Network idle after scroll2 secondsWait for lazy resources (source:api/download-page.js:16)
The 110-second hard timeout is just under Vercel’s 120-second function limit. Long-running captures will return 504 Gateway Timeout.

Examples

Basic Capture

const token = session.access_token // From Supabase auth

const response = await fetch('/api/download-page', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${token}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://example.com/landing-page'
  })
})

if (!response.ok) {
  const error = await response.json()
  throw new Error(error.error)
}

const blob = await response.blob()
const url = URL.createObjectURL(blob)

// Trigger download
const a = document.createElement('a')
a.href = url
a.download = 'captured-page.zip'
a.click()

Error Handling

Comprehensive Error Handling
async function capturePage(url) {
  const { data: { session } } = await supabase.auth.getSession()
  
  if (!session) {
    throw new Error('Not authenticated')
  }
  
  const response = await fetch('/api/download-page', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${session.access_token}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({ url })
  })
  
  if (!response.ok) {
    const error = await response.json()
    
    if (response.status === 429) {
      const retryAfter = response.headers.get('Retry-After')
      throw new Error(`Rate limit exceeded. Try again in ${retryAfter} seconds.`)
    }
    
    if (response.status === 400 && error.error.includes('private')) {
      throw new Error('Cannot capture internal/private URLs')
    }
    
    if (response.status === 504) {
      throw new Error('Page capture timed out. Try a simpler page.')
    }
    
    throw new Error(error.error)
  }
  
  return response.blob()
}

React Hook Example

usePageCapture Hook
import { useState } from 'react'
import { useSupabaseClient } from '@/hooks/useSupabase'

export function usePageCapture() {
  const [loading, setLoading] = useState(false)
  const [error, setError] = useState(null)
  const supabase = useSupabaseClient()
  
  const capture = async (url) => {
    setLoading(true)
    setError(null)
    
    try {
      const { data: { session } } = await supabase.auth.getSession()
      
      const response = await fetch('/api/download-page', {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${session.access_token}`,
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({ url })
      })
      
      if (!response.ok) {
        const err = await response.json()
        throw new Error(err.error)
      }
      
      const blob = await response.blob()
      const downloadUrl = URL.createObjectURL(blob)
      
      // Extract filename from Content-Disposition header
      const disposition = response.headers.get('Content-Disposition')
      const filename = disposition?.match(/filename="(.+)"/)?.[1] || 'page.zip'
      
      // Trigger download
      const a = document.createElement('a')
      a.href = downloadUrl
      a.download = filename
      a.click()
      
      URL.revokeObjectURL(downloadUrl)
      
      return { success: true }
    } catch (err) {
      setError(err.message)
      return { success: false, error: err.message }
    } finally {
      setLoading(false)
    }
  }
  
  return { capture, loading, error }
}

Implementation Details

Browser Engine

  • Puppeteer Core with @sparticuz/chromium (optimized for Vercel/serverless)
  • Headless Chrome with disabled sandboxing for serverless environments
  • Viewport: 1280x800 (source:api/download-page.js:408)

Page Capture Process

  1. Launch browser (different executable paths for dev/production)
  2. Navigate to target URL with networkidle0 wait condition
  3. SSRF re-check on final URL after redirects
  4. Auto-scroll to trigger lazy-loaded content (300px steps with 100ms pauses)
  5. Network idle wait (2 seconds after scroll completes)
  6. SingleFile capture — inlines critical CSS/fonts/images into HTML
  7. Close browser immediately after HTML capture
  8. Build ZIP with captured resources organized by type

Resource Classification

Assets are categorized by MIME type and file extension (source:api/download-page.js:149-179):
Asset Folder Mapping
const MIME_FOLDER_MAP = [
  { test: ct => ct.startsWith('text/css'), folder: 'css' },
  { test: ct => ct.includes('javascript'), folder: 'js' },
  { test: ct => ct.startsWith('image/'), folder: 'images' },
  { test: ct => ct.includes('font'), folder: 'fonts' },
  { test: ct => ct.startsWith('video/') || ct.startsWith('audio/'), folder: 'media' }
]
HTML documents are skipped (already captured by SingleFile).

User-Agent

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36

Best Practices

Validate URLs client-side

Reject private IPs and invalid protocols before sending requests to save quota

Implement retry logic

Handle 504 timeouts with exponential backoff for large pages

Monitor rate limits

Track remaining captures and show warnings before hitting limits

Provide feedback

Captures can take 30-60+ seconds — show progress indicators to users
  • Overview — API authentication and error handling
  • Media Proxy — Proxy individual media assets

Build docs developers (and LLMs) love