Spinney is a Node.js web scraping library that models the entire crawl process as an RxJS Observable stream. Instead of collecting all results upfront, Spinney emits each successfully crawled URL as aDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/cobyeastwood/spinney/llms.txt
Use this file to discover all available pages before exploring further.
next event the moment it is processed — letting your code react in real time. It automatically fetches and parses a target site’s robots.txt file to respect Disallow rules, discovers seed URLs from XML sitemaps when available, deduplicates visited pages, and retries failed requests with back-off. Because Spinney is written in TypeScript and ships its own type definitions, it slots naturally into both TypeScript and plain JavaScript projects.
What is Spinney?
Spinney is published to npm asspinney (version 1.0.2) under the MIT licence. Its central design decision is to extend RxJS Observable<any> directly — the class declaration is export default class Spinney extends Observable<any>. This means every Spinney instance is an Observable, and all standard RxJS operators and subscription semantics apply without any adaptor layer.
When you call spinney.subscribe(...), the library begins crawling and calls subscriber.next(url) for every page that is successfully fetched and parsed. When the crawl queue is exhausted it calls subscriber.complete(). If an unrecoverable error occurs it calls subscriber.error(error).
Under the hood Spinney relies on four runtime libraries:
| Library | Version | Role |
|---|---|---|
| axios | ^0.26.1 | Makes all HTTP GET requests; responses are streamed |
| htmlparser2 | ^7.2.0 | Parses HTML pages via its streaming WritableStream API |
| xml2js | ^0.4.23 | Parses XML sitemaps into arrays of URLs |
| rxjs | ^7.5.5 | Provides the Observable base class and Subscription type |
Core concepts
Spinney is built around six interlocking ideas that together make crawling reliable and respectful.Observable Stream
Spinney extends RxJS
Observable<any> and pushes each crawled URL through subscriber.next(). You can compose it with any RxJS operators or simply subscribe with plain callbacks.robots.txt Enforcement
On every crawl Spinney fetches
/robots.txt first and stores all Disallow paths in an internal Set<string>. URLs that match a Disallow rule are silently skipped unless you pass overide: true.Sitemap Traversal
When the
robots.txt response contains a Sitemap: directive, Spinney uses that sitemap URL as the initial seed instead of the bare origin. XML sitemap entries are parsed with xml2js and fed directly into the crawl queue.Automatic Retry
Each page fetch is wrapped in a retry loop with a maximum of 5 attempts (
MAX_RETRIES = 5). On a non-404 HTTP error the timeout is increased by (retries × 1000) / 4 milliseconds per attempt. 404 responses resolve immediately without retrying.URL Deduplication
Every URL that passes the robots.txt check is recorded in an internal
seen Set<string>. Any URL already present in that set is dropped, so each page is crawled at most once per subscription.Batched Fetching
The crawl queue is processed in batches of 4 concurrent requests using
Promise.all(promises.splice(0, 4)). This keeps memory usage bounded while saturating network I/O.How it works
Here is the full crawl lifecycle in order, from construction to completion.-
Constructor —
new Spinney(site, options?, config?)stores the target URL, creates anew URL(site)for origin parsing, initialises the emptyforbiddenandseensets, and creates an Axios instance withresponseType: 'stream'. The Observable’s subscriber function is registered but not yet executed. -
Subscribe triggers setUp — when
spinney.subscribe(...)is called, RxJS executes the subscriber function, which immediately calls the privatesetUp()method. Any extra handlers beyondnext,error, andcomplete(such asonattributeandontext) are captured from the subscribe options and stored inthis.cbsfor use during HTML parsing. -
robots.txt fetch —
setUp()callshttpText('/robots.txt'), which GETs the constructed robots.txt URL and pipes the response stream throughParseText. The resulting object exposes the parsedforbiddenset of Disallow paths and aisSiteMapflag. -
Forbidden set is stored — the Disallow paths returned from
ParseTextare written intothis.forbiddenviasetForbidden(). From this point any URL whose path matches an entry inforbiddenwill be rejected by_isForbidden(), unless theoverideoption istrue. -
Seed selection — if
context.isSiteMapistrue, the sitemap URL found inrobots.txtis used as the initial entry in the crawl queue; otherwisethis.decodeURL.origin(e.g.https://example.com) is used. Either way, an array of one URL is passed to_setUp(sites)to start the recursive crawl loop. -
Batched page fetching —
_setUp(sites)processes the current URL batch in groups of four withPromise.all. For each URL,httpXMLOrDocument(site)makes an HTTP GET request. If the responseContent-Typeheader contains"xml", the body is parsed as a sitemap and its URLs are returned for the next round. Otherwise the body is piped throughhtmlparser2’sWritableStream. -
href collection and filtering — the
onattributehandler insidehttpXMLOrDocumentwatches for every HTML attribute namedhrefand pushes its value into a localsitesarray. When the stream finishes,getApproved(sites)runs: it resolves relative paths to full URLs, validates each with a URL regex, checks that the hostname matches the origin, and rejects any URL already in theseenset or matching a Disallow rule. -
next and complete events —
subscriber.next(site)fires immediately after a page’s stream finishes, passing the crawled URL string to yournexthandler. The approved hrefs discovered on that page are recursively queued. When_setUpis called with an empty array,subscriber.complete()fires andpause()setsisProcessingtofalse, cleanly ending the crawl.
Next steps
Installation
Add Spinney to your project with npm, yarn, or pnpm and configure TypeScript.
Quickstart
Build your first working scraper in under 5 minutes.