When Spinney parses a site’s robots.txt and finds aDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/cobyeastwood/spinney/llms.txt
Use this file to discover all available pages before exploring further.
Sitemap: directive, it uses that sitemap URL as the starting point for the crawl instead of the bare origin URL. Sitemap files are parsed by the internal ParseXML class, which uses xml2js under the hood and handles both sitemap index files (containing pointers to other sitemaps) and standard URL sets (containing individual page URLs).
How sitemap detection works
During robots.txt parsing,ParseText.onSiteMap() scans each line for the pattern Sitemap: http.... When a match is found, it slices the string from the first occurrence of http onward and stores it in this.site, setting isSiteMap = true:
parse.end() returns { forbidden, site, isSiteMap }. Back in setUp(), Spinney checks the isSiteMap flag and seeds the crawl accordingly:
isSiteMap is true, the sitemap URL is passed to _setUp() as the single starting URL. Spinney then fetches it and routes it through ParseXML because the response Content-Type will contain xml.
Sitemap index vs URL set
ParseXML.write() uses xml2js’s parseStringPromise to turn the raw XML string into a JavaScript object, then checks which root element is present:
sitemapindex— iterates overraw.sitemapindex.sitemapand pushes eachloc[0]value intocontext.sites. These child sitemap URLs are returned as the next batch of sites to crawl, so each one is fetched and parsed in turn.urlset— iterates overraw.urlset.urland pushes eachloc[0]value directly intocontext.sites. These are individual page URLs returned to the crawl queue.
Sitemap index example
URL set example
ParseXML returns ['https://example.com/sitemap-products.xml', 'https://example.com/sitemap-blog.xml']. These are fed back into _setUp(), which fetches each one and parses the resulting URL sets. For the URL set, the individual page URLs are returned directly and queued for HTML crawling.
Fallback behavior
If robots.txt contains noSitemap: directive, isSiteMap remains false and setUp() falls back to this.decodeURL.origin as the single seed URL. Spinney fetches that page, parses all href attributes found in the HTML, and recursively follows approved links using the standard URL approval flow. No sitemap is required for Spinney to crawl a site — it will discover pages by following links just as a browser would.
Content-Type detection
Spinney does not rely solely on the robots.txt sitemap entry to detect XML. InhttpXMLOrDocument(), every response is inspected via isHeaderXML(), which checks both the Content-Type and content-type response headers for the substring xml:
StringWritable to collect the full body, and then ParseXML.promise() parses it. When the response is HTML (no xml in the Content-Type), it is piped through htmlparser2’s WritableStream instead. This means that even if a sitemap URL is discovered mid-crawl via an href attribute rather than through robots.txt, it will still be processed correctly.