Halgorithem can scrape web pages — including Wikipedia — to use as truth source documents. This lets you check AI-generated claims against live web content without preparing text files in advance. You provide a list of URLs, and Halgorithem fetches, cleans, and loads the content automatically.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/TangibleResearch/Halgorithem/llms.txt
Use this file to discover all available pages before exploring further.
How scraping works
TheWebScraper class in Halgorithem/web.py handles fetching. You can instantiate it directly, but in most cases you will use Engine.scrape_urls(), which wraps it and returns structured dicts ready to pass to verify().
Wikipedia URLs
When a URL containswikipedia.org/wiki/, WebScraper uses the Wikipedia REST API (/api/rest_v1/page/summary/{title}) instead of scraping HTML. This returns clean prose text with no markup, navigation, or boilerplate.
Non-Wikipedia URLs
For all other URLs,WebScraper fetches the raw HTML, removes nav, footer, script, style, header, and aside elements using BeautifulSoup, then converts the remaining HTML to plain text.
Each URL has a 5-second request timeout. If a URL times out or returns an HTTP error, it is skipped and a warning is printed to stdout. No exception is raised.
Using Engine.scrape_urls()
Engine.scrape_urls(urls) returns a list of dicts, one per successfully scraped URL:
load_truth_files(), so you can pass the result directly to verify() or combine it with file sources.
Running the full pipeline with URLs
Pass URLs toEngine.run() and it handles scraping, generation, and verification in one call:
result["sources"] contains the list of URLs that were successfully scraped and used as truth documents.
Mixing URLs and local files
You can combine URL sources and local files in the same call. Halgorithem merges them into a single pool of truth documents:verify():