Before Spinney issues a single crawl request, it fetchesDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/cobyeastwood/spinney/llms.txt
Use this file to discover all available pages before exploring further.
/robots.txt from the target domain. The file is streamed into the internal ParseText class, which reads it line by line, extracts every Disallow: path listed under User-agent: *, and stores those paths in a Set<string> called forbidden. Every candidate URL is checked against this set before it is queued, so restricted paths are never fetched.
What gets parsed
ParseText processes the robots.txt stream one chunk at a time, splitting each chunk on newlines and passing each line through three handlers in sequence:
onUserAgent(line)— tests the line against theUser-agent:regex. When aUser-agent: *line is detected, it sets an internalisParsing = trueflag. Any otherUser-agent:value setsisParsing = false, so only wildcard rules are collected.onDisallow(line)— whenisParsingis true and the line matches theDisallow:pattern, the path (everything from the first/onward) is added to theforbiddenSet.onSiteMap(line)— when the line matches theSitemap:pattern and contains anhttpURL,isSiteMapis set totrueand the URL is stored insitefor use as the crawl seed.
parse.end() returns { forbidden, site, isSiteMap } and resets the parser.
Given the following robots.txt:
forbidden = Set { '/admin', '/private/' }. The /staging disallow is ignored because it is scoped to Googlebot, not *. The sitemap URL is captured and used as the crawl seed (see Sitemaps).
How Disallow enforcement works
Onceforbidden is populated, every candidate URL goes through isForbidden() before being added to the crawl queue:
isForbidden() first checks the seen Set. If the URL has already been processed it returns false immediately, enforcing deduplication across the entire crawl. For new URLs it delegates to _isForbidden(), which iterates over every path in forbidden and calls isMatch():
isMatch() builds a RegExp from the site’s hostname and the disallowed path pattern, then tests the candidate URL against it. If any forbidden path matches, _isForbidden() returns false and the URL is dropped from the crawl queue. URLs that pass all checks return true and proceed to crawling.
The overide option
Passing options.overide = true to the constructor bypasses the entire forbidden-path check. Internally, _isForbidden() checks this.isOveride first and returns true immediately — meaning every URL on the origin domain is treated as allowed:
What happens if robots.txt is missing
If the HTTP request for/robots.txt fails (network error, non-200 status, empty response), the forbidden Set is never populated and remains empty. _isForbidden() iterates over an empty set, finds no matches, and returns true for every URL — so all paths on the origin domain are eligible for crawling. Spinney will still apply its own deduplication via the seen Set and will still validate hostnames with isApproved().
URL approval flow
Everyhref attribute extracted from a crawled page travels through the following pipeline before it can enter the crawl queue:
- Extraction — htmlparser2’s
WritableStreamfiresonattributefor every attribute;hrefvalues are collected into asitesarray. - Resolution —
getURL(pathname)converts relative paths (those beginning with/) into absolute URLs using the origin from the constructor argument. - Hostname check —
isApproved()validates that the resolved URL’s hostname starts with the origin hostname, preventing Spinney from wandering off-domain. - Deduplication + forbidden check —
isForbidden()rejects URLs already present inseenand runs_isForbidden()against theforbiddenSet for new ones. - Queued — URLs that pass all checks are returned from
getApproved()and passed to_setUp()for the next crawl batch.