Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/TangibleResearch/Halgorithem/llms.txt

Use this file to discover all available pages before exploring further.

The WebScraper class in Halgorithem.web fetches and converts web pages to plain text for use as truth source documents. It handles both standard HTML pages and Wikipedia URLs with special treatment.

Constructor

WebScraper(list_of_urls)
list_of_urls
list[str]
required
List of URLs to scrape. Each URL is processed in order when you call scrape().

Methods

scrape

scrape() -> None
Iterates over all URLs provided at construction time, fetches each one, converts it to plain text, and writes the result to a numbered text file in the current working directory. Output files are named file0.txt, file1.txt, and so on — the counter increments only on success. Failed URLs print a warning and are skipped without incrementing the counter. Wikipedia URLs (wikipedia.org/wiki/): instead of scraping the HTML page, scrape() calls the clean REST summary API at https://en.wikipedia.org/api/rest_v1/page/summary/{title} and extracts the extract field from the JSON response. All other URLs: fetches the page with BeautifulSoup, removes nav, footer, script, style, header, and aside tags, then converts the remaining HTML to markdown-style plain text via html2text.
Content from non-Wikipedia URLs is capped at 8,000 characters. For lengthy reference pages this may truncate important information. Consider splitting long sources into multiple targeted URLs or using Wikipedia when a high-quality summary is available.
Request details:
SettingValue
Timeout5 seconds per request
User-AgentMozilla/5.0 (compatible; HalgorithemBot/1.0)
Link/image handlingLinks and images stripped from HTML output
Error handling: Timeout, HTTPError, and general Exception errors each print a warning message and continue to the next URL without raising.
Prefer Wikipedia URLs when a topic has a good Wikipedia article. The REST summary API returns clean, well-structured text without the 8,000 character cap, and avoids JavaScript-heavy or paywalled pages that may return poor HTML.

Direct usage

from Halgorithem.web import WebScraper
import os

os.chdir("/tmp/my_sources")  # output files written to CWD
scraper = WebScraper([
    "https://en.wikipedia.org/wiki/Apollo_11",
    "https://www.britannica.com/event/Apollo-11"
])
scraper.scrape()
# creates: /tmp/my_sources/file0.txt, /tmp/my_sources/file1.txt

Higher-level alternative: Engine.scrape_urls()

For most use cases you should call Engine.scrape_urls() rather than using WebScraper directly. It manages a temporary directory, resets the file counter, and returns structured dicts instead of writing files you have to manage yourself.
from engine import Engine

eng = Engine()
docs = eng.scrape_urls([
    "https://en.wikipedia.org/wiki/Apollo_11"
])
# docs = [{"file_id": 1, "file_path": "https://...", "text": "..."}]
Use Engine.scrape_urls() rather than instantiating WebScraper directly. The Engine wrapper handles temporary directory setup, counter state, and returns structured document dicts that integrate directly with compare_to_docs() and the rest of the Halgorithem pipeline.

Build docs developers (and LLMs) love