Spinney: Observable Web Scraper for Node.js Projects

Spinney is a Node.js web scraping library that models the entire crawl process as an RxJS Observable stream. Instead of collecting all results upfront, Spinney emits each successfully crawled URL as a next event the moment it is processed — letting your code react in real time. It automatically fetches and parses a target site’s robots.txt file to respect Disallow rules, discovers seed URLs from XML sitemaps when available, deduplicates visited pages, and retries failed requests with back-off. Because Spinney is written in TypeScript and ships its own type definitions, it slots naturally into both TypeScript and plain JavaScript projects.

What is Spinney?

Spinney is published to npm as spinney (version 1.0.2) under the MIT licence. Its central design decision is to extend RxJS Observable<any> directly — the class declaration is export default class Spinney extends Observable<any>. This means every Spinney instance is an Observable, and all standard RxJS operators and subscription semantics apply without any adaptor layer. When you call spinney.subscribe(...), the library begins crawling and calls subscriber.next(url) for every page that is successfully fetched and parsed. When the crawl queue is exhausted it calls subscriber.complete(). If an unrecoverable error occurs it calls subscriber.error(error). Under the hood Spinney relies on four runtime libraries:

Library	Version	Role
axios	`^0.26.1`	Makes all HTTP GET requests; responses are streamed
htmlparser2	`^7.2.0`	Parses HTML pages via its streaming `WritableStream` API
xml2js	`^0.4.23`	Parses XML sitemaps into arrays of URLs
rxjs	`^7.5.5`	Provides the `Observable` base class and `Subscription` type

Core concepts

Spinney is built around six interlocking ideas that together make crawling reliable and respectful.

Observable Stream

Spinney extends RxJS Observable<any> and pushes each crawled URL through subscriber.next(). You can compose it with any RxJS operators or simply subscribe with plain callbacks.

robots.txt Enforcement

On every crawl Spinney fetches /robots.txt first and stores all Disallow paths in an internal Set<string>. URLs that match a Disallow rule are silently skipped unless you pass overide: true.

Sitemap Traversal

When the robots.txt response contains a Sitemap: directive, Spinney uses that sitemap URL as the initial seed instead of the bare origin. XML sitemap entries are parsed with xml2js and fed directly into the crawl queue.

Automatic Retry

Each page fetch is wrapped in a retry loop with a maximum of 5 attempts (MAX_RETRIES = 5). On a non-404 HTTP error the timeout is increased by (retries × 1000) / 4 milliseconds per attempt. 404 responses resolve immediately without retrying.

URL Deduplication

Every URL that passes the robots.txt check is recorded in an internal seen Set<string>. Any URL already present in that set is dropped, so each page is crawled at most once per subscription.

Batched Fetching

The crawl queue is processed in batches of 4 concurrent requests using Promise.all(promises.splice(0, 4)). This keeps memory usage bounded while saturating network I/O.

How it works

Here is the full crawl lifecycle in order, from construction to completion.

Constructor — new Spinney(site, options?, config?) stores the target URL, creates a new URL(site) for origin parsing, initialises the empty forbidden and seen sets, and creates an Axios instance with responseType: 'stream'. The Observable’s subscriber function is registered but not yet executed.
Subscribe triggers setUp — when spinney.subscribe(...) is called, RxJS executes the subscriber function, which immediately calls the private setUp() method. Any extra handlers beyond next, error, and complete (such as onattribute and ontext) are captured from the subscribe options and stored in this.cbs for use during HTML parsing.
robots.txt fetch — setUp() calls httpText('/robots.txt'), which GETs the constructed robots.txt URL and pipes the response stream through ParseText. The resulting object exposes the parsed forbidden set of Disallow paths and a isSiteMap flag.
Forbidden set is stored — the Disallow paths returned from ParseText are written into this.forbidden via setForbidden(). From this point any URL whose path matches an entry in forbidden will be rejected by _isForbidden(), unless the overide option is true.
Seed selection — if context.isSiteMap is true, the sitemap URL found in robots.txt is used as the initial entry in the crawl queue; otherwise this.decodeURL.origin (e.g. https://example.com) is used. Either way, an array of one URL is passed to _setUp(sites) to start the recursive crawl loop.
Batched page fetching — _setUp(sites) processes the current URL batch in groups of four with Promise.all. For each URL, httpXMLOrDocument(site) makes an HTTP GET request. If the response Content-Type header contains "xml", the body is parsed as a sitemap and its URLs are returned for the next round. Otherwise the body is piped through htmlparser2’s WritableStream.
href collection and filtering — the onattribute handler inside httpXMLOrDocument watches for every HTML attribute named href and pushes its value into a local sites array. When the stream finishes, getApproved(sites) runs: it resolves relative paths to full URLs, validates each with a URL regex, checks that the hostname matches the origin, and rejects any URL already in the seen set or matching a Disallow rule.
next and complete events — subscriber.next(site) fires immediately after a page’s stream finishes, passing the crawled URL string to your next handler. The approved hrefs discovered on that page are recursively queued. When _setUp is called with an empty array, subscriber.complete() fires and pause() sets isProcessing to false, cleanly ending the crawl.

Get Started

Guides

API Reference

Spinney: Observable Web Scraper for Node.js Projects

What is Spinney?

Core concepts

Observable Stream

robots.txt Enforcement

Sitemap Traversal

Automatic Retry

URL Deduplication

Batched Fetching

How it works

Next steps

Installation

Quickstart

Build docs developers (and LLMs) love

Get Started

Guides

API Reference

Documentation Index

​What is Spinney?

​Core concepts

Observable Stream

robots.txt Enforcement

Sitemap Traversal

Automatic Retry

URL Deduplication

Batched Fetching

​How it works

​Next steps

Installation

Quickstart

Build docs developers (and LLMs) love

What is Spinney?

Core concepts

How it works

Next steps