TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/TangibleResearch/Halgorithem/llms.txt
Use this file to discover all available pages before exploring further.
WebScraper class in Halgorithem.web fetches and converts web pages to plain text for use as truth source documents. It handles both standard HTML pages and Wikipedia URLs with special treatment.
Constructor
List of URLs to scrape. Each URL is processed in order when you call
scrape().Methods
scrape
file0.txt, file1.txt, and so on — the counter increments only on success. Failed URLs print a warning and are skipped without incrementing the counter.
Wikipedia URLs (wikipedia.org/wiki/): instead of scraping the HTML page, scrape() calls the clean REST summary API at https://en.wikipedia.org/api/rest_v1/page/summary/{title} and extracts the extract field from the JSON response.
All other URLs: fetches the page with BeautifulSoup, removes nav, footer, script, style, header, and aside tags, then converts the remaining HTML to markdown-style plain text via html2text.
Request details:
| Setting | Value |
|---|---|
| Timeout | 5 seconds per request |
| User-Agent | Mozilla/5.0 (compatible; HalgorithemBot/1.0) |
| Link/image handling | Links and images stripped from HTML output |
Timeout, HTTPError, and general Exception errors each print a warning message and continue to the next URL without raising.
Direct usage
Higher-level alternative: Engine.scrape_urls()
For most use cases you should callEngine.scrape_urls() rather than using WebScraper directly. It manages a temporary directory, resets the file counter, and returns structured dicts instead of writing files you have to manage yourself.
Use
Engine.scrape_urls() rather than instantiating WebScraper directly. The Engine wrapper handles temporary directory setup, counter state, and returns structured document dicts that integrate directly with compare_to_docs() and the rest of the Halgorithem pipeline.