Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/cobyeastwood/spinney/llms.txt

Use this file to discover all available pages before exploring further.

This guide walks you through building a working web scraper that streams every crawled URL to your console and collects all href attribute values found across a target site. By the end you will have a complete, runnable TypeScript file that you can point at any website and start collecting data immediately.
1

Install Spinney

Add Spinney to your project with a single command:
yarn add spinney
Spinney pulls in its four runtime dependencies — axios, htmlparser2, rxjs, and xml2js — automatically.
2

Import and instantiate

Create a new file (e.g. scraper.ts) and import the Spinney class. Instantiate it with the URL of the site you want to crawl. No network activity happens yet — the crawl only begins when you call .subscribe().
scraper.ts
import Spinney from 'spinney';

const spinney = new Spinney('https://example.com/');
The constructor signature is:
new Spinney(site: string, options?: Options, config?: AxiosRequestConfig)
  • site — the target URL to crawl (must be a valid absolute URL).
  • options — optional { debug?: boolean; overide?: boolean } object.
  • config — optional Axios request config forwarded to the internal Axios instance.
3

Subscribe to events

Call .subscribe() to start the crawl. Pass an object containing any combination of the handlers below. The crawl begins immediately on subscribe.
scraper.ts
const subscription = spinney.subscribe({
  next(site) {
    console.log('Crawled:', site);
  },
  ontext(text) {
    console.log('Text:', text);
  },
  onattribute(name, value, quote) {
    console.log('Attribute:', name, value, quote);
  },
  error(error) {
    console.error('Error:', error);
  },
  complete() {
    console.log('Done');
  },
});
HandlerWhen it firesArguments
next(site)After each page is fully fetched and parsedThe crawled URL string
onattribute(name, value, quote)For every HTML attribute on every element of every pageAttribute name, value, and optional quote character
ontext(text)For every text node encountered during HTML parsingThe raw text string
error(error)On any unrecoverable crawl errorThe Error object
complete()When the crawl queue is exhausted
The onattribute and ontext handlers are htmlparser2 handler callbacks and are called synchronously as the HTML stream is parsed. The next, error, and complete handlers follow standard RxJS Observable semantics.
4

Unsubscribe

Call subscription.unsubscribe() to stop the crawl early. This is the standard RxJS pattern for cancelling a subscription.
subscription.unsubscribe();
Call unsubscribe() when:
  • You have collected the data you need before the crawl finishes naturally.
  • You want to implement a timeout (e.g. stop after 30 seconds regardless of queue depth).
  • Your process is shutting down and you want a clean teardown.
If you do not call unsubscribe() manually, the Observable cleans up automatically when complete() or error() fires.

Full example

The following self-contained TypeScript file collects every crawled URL into an array, tracks all href attribute values seen during parsing, and logs a summary on completion. Copy it, update the URL, and run it with ts-node scraper.ts.
scraper.ts
import Spinney from 'spinney';

const TARGET = 'https://example.com/';

const crawledPages: string[] = [];
const collectedHrefs: string[] = [];

const spinney = new Spinney(TARGET, { debug: true });

const subscription = spinney.subscribe({
  next(site: string) {
    crawledPages.push(site);
    console.log(`[${crawledPages.length}] Crawled: ${site}`);
  },

  onattribute(name: string, value: string) {
    if (name === 'href' && value) {
      collectedHrefs.push(value);
    }
  },

  error(error: Error) {
    console.error('Crawl error:', error.message);
  },

  complete() {
    console.log('---');
    console.log(`Crawl complete. Pages visited : ${crawledPages.length}`);
    console.log(`Unique href values collected  : ${collectedHrefs.length}`);
  },
});

// Optional: stop after 60 seconds
setTimeout(() => {
  console.log('Timeout reached — unsubscribing.');
  subscription.unsubscribe();
}, 60_000);

What happens under the hood

When you call subscribe(), Spinney immediately fetches /robots.txt from the target origin and parses its Disallow directives into an internal forbidden Set. If a Sitemap: entry is present in robots.txt, that sitemap URL becomes the first item in the crawl queue; otherwise the bare origin (e.g. https://example.com) is used as the seed. Pages are then fetched in batches of four concurrent requests using Promise.all. For each HTML response, htmlparser2 streams the body and collects every href attribute value. Those values are resolved to absolute URLs, validated, checked against the forbidden set, and deduplicated against a seen set — only new, permitted URLs enter the next batch. subscriber.next(url) fires as soon as a page finishes parsing, so your next handler receives results progressively rather than all at once at the end.
Pass { debug: true } as the second constructor argument to have Spinney log error messages to stderr via console.error during development. This is controlled by the options.debug flag and uses a no-op function in production when omitted, so there is zero overhead in deployed code.
const spinney = new Spinney('https://example.com/', { debug: true });
Spinney respects robots.txt Disallow rules by default. If you need to crawl paths that are disallowed (for example when scraping your own site in a test environment), pass { overide: true } — note the spelling matches the source exactly — as the second constructor argument. With overide: true, the _isForbidden check always returns true (allowed) regardless of the forbidden set.
const spinney = new Spinney('https://example.com/', { overide: true });
For a full reference of every constructor parameter, method, and option see the Spinney class API.

Build docs developers (and LLMs) love