This guide walks you through building a working web scraper that streams every crawled URL to your console and collects allDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/cobyeastwood/spinney/llms.txt
Use this file to discover all available pages before exploring further.
href attribute values found across a target site. By the end you will have a complete, runnable TypeScript file that you can point at any website and start collecting data immediately.
Install Spinney
Add Spinney to your project with a single command:Spinney pulls in its four runtime dependencies — axios, htmlparser2, rxjs, and xml2js — automatically.
Import and instantiate
Create a new file (e.g. The constructor signature is:
scraper.ts) and import the Spinney class. Instantiate it with the URL of the site you want to crawl. No network activity happens yet — the crawl only begins when you call .subscribe().scraper.ts
site— the target URL to crawl (must be a valid absolute URL).options— optional{ debug?: boolean; overide?: boolean }object.config— optional Axios request config forwarded to the internal Axios instance.
Subscribe to events
Call
.subscribe() to start the crawl. Pass an object containing any combination of the handlers below. The crawl begins immediately on subscribe.scraper.ts
| Handler | When it fires | Arguments |
|---|---|---|
next(site) | After each page is fully fetched and parsed | The crawled URL string |
onattribute(name, value, quote) | For every HTML attribute on every element of every page | Attribute name, value, and optional quote character |
ontext(text) | For every text node encountered during HTML parsing | The raw text string |
error(error) | On any unrecoverable crawl error | The Error object |
complete() | When the crawl queue is exhausted | — |
The
onattribute and ontext handlers are htmlparser2 handler callbacks and are called synchronously as the HTML stream is parsed. The next, error, and complete handlers follow standard RxJS Observable semantics.Unsubscribe
Call Call
subscription.unsubscribe() to stop the crawl early. This is the standard RxJS pattern for cancelling a subscription.unsubscribe() when:- You have collected the data you need before the crawl finishes naturally.
- You want to implement a timeout (e.g. stop after 30 seconds regardless of queue depth).
- Your process is shutting down and you want a clean teardown.
unsubscribe() manually, the Observable cleans up automatically when complete() or error() fires.Full example
The following self-contained TypeScript file collects every crawled URL into an array, tracks allhref attribute values seen during parsing, and logs a summary on completion. Copy it, update the URL, and run it with ts-node scraper.ts.
scraper.ts
What happens under the hood
When you callsubscribe(), Spinney immediately fetches /robots.txt from the target origin and parses its Disallow directives into an internal forbidden Set. If a Sitemap: entry is present in robots.txt, that sitemap URL becomes the first item in the crawl queue; otherwise the bare origin (e.g. https://example.com) is used as the seed. Pages are then fetched in batches of four concurrent requests using Promise.all. For each HTML response, htmlparser2 streams the body and collects every href attribute value. Those values are resolved to absolute URLs, validated, checked against the forbidden set, and deduplicated against a seen set — only new, permitted URLs enter the next batch. subscriber.next(url) fires as soon as a page finishes parsing, so your next handler receives results progressively rather than all at once at the end.
Spinney respects
robots.txt Disallow rules by default. If you need to crawl paths that are disallowed (for example when scraping your own site in a test environment), pass { overide: true } — note the spelling matches the source exactly — as the second constructor argument. With overide: true, the _isForbidden check always returns true (allowed) regardless of the forbidden set.