Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vrashmanyu605-eng/Agentic_Sales-Markerting/llms.txt

Use this file to discover all available pages before exploring further.

The website scraper gives agents clean, readable text from any public webpage. It uses requests to fetch the page with a browser-like User-Agent header, then passes the HTML through BeautifulSoup to strip out <script>, <style>, and <noscript> tags before extracting visible text. The result is capped at 10,000 characters so it fits comfortably inside an LLM context window.
The 10,000-character limit keeps LLM context manageable. Most company home pages convey their core value proposition, product categories, and industry well within this budget.

Source code

website_scraper.py
import requests

from bs4 import BeautifulSoup


def scrape_website(url):

    try:

        headers = {
            "User-Agent": (
                "Mozilla/5.0 "
                "(Windows NT 10.0; Win64; x64)"
            )
        }

        response = requests.get(
            url,
            headers=headers,
            timeout=10
        )

        soup = BeautifulSoup(
            response.text,
            "lxml"
        )

        # Remove unwanted tags
        for tag in soup([
            "script",
            "style",
            "noscript"
        ]):
            tag.decompose()

        text = soup.get_text(
            separator=" ",
            strip=True
        )

        return text[:10000]

    except Exception as e:

        return f"Scraping Error: {str(e)}"

scrape_website()

Fetches a URL and returns cleaned plain text extracted from the page body.

Parameters

url
string
required
The full URL of the page to scrape, including the scheme (e.g. "https://www.acmecorp.com").

Return value

Returns a str containing up to 10,000 characters of whitespace-normalized plain text from the page. Words are joined with a single space separator and leading/trailing whitespace is stripped. On any exception — network error, timeout, parse failure — the function returns an error string in the format "Scraping Error: <message>" rather than raising an exception.

Behavior details

AspectDetail
HTTP clientrequests.get()
User-AgentMozilla/5.0 (Windows NT 10.0; Win64; x64)
Timeout10 seconds
HTML parserlxml via BeautifulSoup
Tags removed<script>, <style>, <noscript>
Text extractionsoup.get_text(separator=" ", strip=True)
Output limittext[:10000] — first 10,000 characters
Error handlingReturns error string, never raises
Some sites block scrapers via bot detection, Cloudflare, or JavaScript-rendered content. When scraping fails, the function returns an error string such as "Scraping Error: 403 Client Error" — not an exception. Agents that call scrape_website() must handle this case by checking whether the return value starts with "Scraping Error:".

Build docs developers (and LLMs) love