MCP Capabilities

Scrapling’s MCP server provides six powerful tools for web scraping operations. Each tool is optimized for different use cases and protection levels.

Available Tools

get

Make stealth HTTP GET requests to fetch web pages. Best for: Low to mid protection levels, simple HTTP requests Parameters:

url

string

required

The URL to request

impersonate

string

default:"chrome"

Browser to impersonate (chrome, firefox, safari, etc.)

extraction_type

string

default:"markdown"

Output format: markdown, html, or text

css_selector

string

CSS selector to extract specific content

main_content_only

boolean

default:"true"

Extract only content within <body> tag

headers

object

Custom HTTP headers

object

Cookies to include in request

proxy

string

Proxy URL (format: “http://user:pass@host:port”)

timeout

number

default:"30"

Request timeout in seconds

stealthy_headers

boolean

default:"true"

Use real browser headers

Example usage:

{
  "url": "https://example.com",
  "extraction_type": "markdown",
  "css_selector": "article.main",
  "impersonate": "chrome"
}

bulk_get

Fetch multiple URLs concurrently with HTTP GET requests. Best for: Scraping multiple pages efficiently Parameters: Same as get, but accepts urls (array) instead of url (string).

urls

array[string]

required

List of URLs to fetch concurrently

Example usage:

{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "extraction_type": "markdown",
  "impersonate": "firefox"
}

fetch

Use Playwright browser automation for JavaScript-heavy sites. Best for: Single-page applications, sites requiring JavaScript execution Parameters:

url

string

required

The URL to fetch

extraction_type

string

default:"markdown"

Output format: markdown, html, or text

headless

boolean

default:"true"

Run browser in headless mode

disable_resources

boolean

default:"false"

Block images, fonts, media for speed boost

network_idle

boolean

default:"false"

Wait for no network activity for 500ms

timeout

number

default:"30000"

Timeout in milliseconds

wait

number

default:"0"

Additional wait time in milliseconds

wait_selector

string

CSS selector to wait for before proceeding

wait_selector_state

string

default:"attached"

State to wait for: attached, detached, visible, hidden

real_chrome

boolean

default:"false"

Use real Chrome installation instead of Chromium

google_search

boolean

default:"true"

Set referer as Google search of domain

Example usage:

{
  "url": "https://spa-website.com",
  "extraction_type": "markdown",
  "wait_selector": "div.content-loaded",
  "network_idle": true,
  "disable_resources": true
}

bulk_fetch

Fetch multiple URLs with browser automation concurrently. Best for: Scraping multiple JavaScript-heavy pages Parameters: Same as fetch, but accepts urls (array) instead of url (string). Example usage:

{
  "urls": [
    "https://app1.example.com",
    "https://app2.example.com"
  ],
  "headless": true,
  "network_idle": true
}

stealthy_fetch

Advanced stealth browser automation with Cloudflare bypass. Best for: High protection sites, Cloudflare-protected pages Parameters: All fetch parameters, plus:

solve_cloudflare

boolean

default:"false"

Automatically solve Cloudflare challenges

block_webrtc

boolean

default:"false"

Block WebRTC to prevent IP leaks

allow_webgl

boolean

default:"true"

Allow WebGL (recommended for stealth)

hide_canvas

boolean

default:"false"

Add noise to canvas fingerprinting

additional_args

object

Additional Playwright context settings

Example usage:

{
  "url": "https://protected-site.com",
  "extraction_type": "markdown",
  "solve_cloudflare": true,
  "block_webrtc": true,
  "hide_canvas": true,
  "wait": 2000
}

bulk_stealthy_fetch

Fetch multiple protected URLs with advanced stealth. Best for: Scraping multiple Cloudflare-protected sites Parameters: Same as stealthy_fetch, but accepts urls (array) instead of url (string). Example usage:

{
  "urls": [
    "https://protected1.com",
    "https://protected2.com"
  ],
  "solve_cloudflare": true,
  "network_idle": true
}

Response Format

All tools return a structured response:

{
  "status": 200,
  "content": ["Extracted content in requested format"],
  "url": "https://example.com"
}

For bulk operations, an array of responses is returned:

[
  {
    "status": 200,
    "content": ["Content from URL 1"],
    "url": "https://example.com/page1"
  },
  {
    "status": 200,
    "content": ["Content from URL 2"],
    "url": "https://example.com/page2"
  }
]

Extraction Types

Markdown
HTML
Text

Converts HTML to clean Markdown format:

{"extraction_type": "markdown"}

Best for: Readable text, content processing, AI consumption

Returns raw HTML content:

{"extraction_type": "html"}

Best for: Preserving structure, further parsing, archival

Extracts plain text only:

{"extraction_type": "text"}

Best for: Text analysis, search indexing, minimal data

CSS Selectors

All tools support CSS selectors for targeted extraction:

# Extract all articles
{"css_selector": "article.post"}

# Extract main content
{"css_selector": "main#content"}

# Extract specific elements
{"css_selector": "div.product-info"}

When css_selector matches multiple elements, all matches are returned in the content array.

Authentication

HTTP Basic Auth

{
  "url": "https://example.com",
  "auth": {
    "username": "user",
    "password": "pass"
  }
}

Proxy Authentication

{
  "url": "https://example.com",
  "proxy": "http://proxy.example.com:8080",
  "proxy_auth": {
    "username": "proxy_user",
    "password": "proxy_pass"
  }
}

Common Patterns

Simple page fetch

{
  "url": "https://example.com",
  "extraction_type": "markdown"
}

Extract article content

{
  "url": "https://news.example.com/article",
  "css_selector": "article.content",
  "extraction_type": "markdown",
  "main_content_only": true
}

Scrape SPA application

{
  "url": "https://spa.example.com",
  "wait_selector": "div.loaded",
  "network_idle": true,
  "extraction_type": "html"
}

Bypass Cloudflare

{
  "url": "https://protected.example.com",
  "solve_cloudflare": true,
  "wait": 2000,
  "extraction_type": "markdown"
}

Bulk scraping with stealth

{
  "urls": [
    "https://site1.com",
    "https://site2.com",
    "https://site3.com"
  ],
  "impersonate": "chrome,firefox,safari",
  "stealthy_headers": true,
  "extraction_type": "markdown"
}

Tool Selection Guide

Simple HTTP sites

Use get or bulk_get for basic HTML pages without JavaScript

JavaScript-heavy sites

Use fetch or bulk_fetch for SPAs and dynamic content

Protected sites

Use stealthy_fetch or bulk_stealthy_fetch for Cloudflare and WAF-protected sites

Multiple URLs

Use bulk variants (bulk_get, bulk_fetch, bulk_stealthy_fetch) for concurrent operations

MCP Server

Learn about the MCP server

Setup Guide

Configure MCP server for AI clients

Getting Started

Core Concepts

Fetching

Parsing & Selection

Spiders

CLI & Tools

AI Integration

Guides

Tutorials

Available Tools

get

bulk_get

fetch

bulk_fetch

stealthy_fetch

bulk_stealthy_fetch

Response Format

Extraction Types

CSS Selectors

Authentication

HTTP Basic Auth

Proxy Authentication

Common Patterns

Tool Selection Guide

MCP Server

Setup Guide

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Fetching

Parsing & Selection

Spiders

CLI & Tools

AI Integration

Guides

Tutorials

Documentation Index

​Available Tools

​get

​bulk_get

​fetch

​bulk_fetch

​stealthy_fetch

​bulk_stealthy_fetch

​Response Format

​Extraction Types

​CSS Selectors

​Authentication

​HTTP Basic Auth

​Proxy Authentication

​Common Patterns

​Tool Selection Guide

​Related Documentation

MCP Server

Setup Guide

Build docs developers (and LLMs) love

Available Tools

get

bulk_get

fetch

bulk_fetch

stealthy_fetch

bulk_stealthy_fetch

Response Format

Extraction Types

CSS Selectors

Authentication

HTTP Basic Auth

Proxy Authentication

Common Patterns

Tool Selection Guide

Related Documentation