Website scraper: extract clean text from company websites

The website scraper gives agents clean, readable text from any public webpage. It uses requests to fetch the page with a browser-like User-Agent header, then passes the HTML through BeautifulSoup to strip out <script>, <style>, and <noscript> tags before extracting visible text. The result is capped at 10,000 characters so it fits comfortably inside an LLM context window.

The 10,000-character limit keeps LLM context manageable. Most company home pages convey their core value proposition, product categories, and industry well within this budget.

Source code

website_scraper.py

import requests

from bs4 import BeautifulSoup


def scrape_website(url):

    try:

        headers = {
            "User-Agent": (
                "Mozilla/5.0 "
                "(Windows NT 10.0; Win64; x64)"
            )
        }

        response = requests.get(
            url,
            headers=headers,
            timeout=10
        )

        soup = BeautifulSoup(
            response.text,
            "lxml"
        )

        # Remove unwanted tags
        for tag in soup([
            "script",
            "style",
            "noscript"
        ]):
            tag.decompose()

        text = soup.get_text(
            separator=" ",
            strip=True
        )

        return text[:10000]

    except Exception as e:

        return f"Scraping Error: {str(e)}"

`scrape_website()`

Fetches a URL and returns cleaned plain text extracted from the page body.

Parameters

url

string

required

The full URL of the page to scrape, including the scheme (e.g. "https://www.acmecorp.com").

Return value

Returns a str containing up to 10,000 characters of whitespace-normalized plain text from the page. Words are joined with a single space separator and leading/trailing whitespace is stripped. On any exception — network error, timeout, parse failure — the function returns an error string in the format "Scraping Error: <message>" rather than raising an exception.

Behavior details

Aspect	Detail
HTTP client	`requests.get()`
User-Agent	`Mozilla/5.0 (Windows NT 10.0; Win64; x64)`
Timeout	10 seconds
HTML parser	`lxml` via BeautifulSoup
Tags removed	`<script>`, `<style>`, `<noscript>`
Text extraction	`soup.get_text(separator=" ", strip=True)`
Output limit	`text[:10000]` — first 10,000 characters
Error handling	Returns error string, never raises

Some sites block scrapers via bot detection, Cloudflare, or JavaScript-rendered content. When scraping fails, the function returns an error string such as "Scraping Error: 403 Client Error" — not an exception. Agents that call scrape_website() must handle this case by checking whether the return value starts with "Scraping Error:".

Web search tool: DuckDuckGo-powered company discovery

LinkedIn finder: locate decision-makers by company

Auto-generate your docs

Source code
scrape_website()
Parameters
Return value
Behavior details

Build docs developers (and LLMs) love

Get started for free Talk to us

Get Started

Architecture

Agents

Tools

Guides

Website scraper: extract clean text from company websites

Source code

`scrape_website()`

Parameters

Return value

Behavior details

Build docs developers (and LLMs) love

Get Started

Architecture

Agents

Tools

Guides

Documentation Index

​Source code

​scrape_website()

​Parameters

​Return value

​Behavior details

Build docs developers (and LLMs) love

Source code

`scrape_website()`

Parameters

Return value

Behavior details