Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/cobyeastwood/spinney/llms.txt

Use this file to discover all available pages before exploring further.

Spinney is the main — and only — export of the spinney package. It extends Observable<any> from RxJS, which means any RxJS operator (pipe, take, filter, tap, and so on) or subscription method can be applied to it directly. Internally, the constructor sets up the Axios HTTP instance with responseType: 'stream' locked in, initialises a deduplication seen Set and a forbidden Set (for robots.txt paths), and defines the Observable source function that calls setUp() on first subscribe. setUp() triggers the robots.txt fetch, populates forbidden, then calls resume() to set isProcessing to true and kicks off the recursive batch crawl.

Import

// ES Module / TypeScript
import Spinney from 'spinney';

// CommonJS
const Spinney = require('spinney');

Constructor signature

new Spinney(
  site: string,
  options?: Options,
  config?: AxiosRequestConfig
): Spinney

Parameters

site
string
required
The target URL to crawl. Must be a valid URL that new URL(site) can parse — an invalid or empty string will throw a TypeError at construction time.
'https://example.com/'
options
Options
Scraping behaviour flags. See the Options type reference for full details. Both fields default to false if the object is omitted entirely.
{ overide: false, debug: false }
config
AxiosRequestConfig
Passed directly to axios.create() via Object.assign({ responseType: 'stream' }, config ?? {}). The property responseType: 'stream' is always set by Spinney and cannot be overridden — all other Axios config (timeouts, headers, proxy, etc.) is respected as-is.

Throws

The constructor calls new URL(site) synchronously. If site is not a valid URL — including an empty string — the URL constructor throws a TypeError, which propagates immediately out of new Spinney(...) before any network activity begins.
// From the test suite:
expect(() => new Spinney('')).toThrow();

Internal state initialised

When the constructor runs, it sets up the following private fields that drive the crawl lifecycle:
  • cbs — an object registry that stores htmlparser2 handler callbacks (e.g. onattribute, ontext) passed in through subscribe(), kept separate from the RxJS observer callbacks.
  • debug — either a no-op function (when options.debug is false or omitted) or console.error(error?.message) (when options.debug is true). Called on non-fatal internal errors.
  • isOveride — boolean derived from options?.overide ?? false. When true, the forbidden Set is bypassed and all URLs are treated as crawlable.
  • isProcessing — starts as false. Flipped to true by resume() once robots.txt has been fetched and parsed inside the httpText('/robots.txt').then(...) callback in setUp(), and back to false by pause() when the crawl ends or a fatal error occurs.
  • forbidden — a Set<string> of disallowed path patterns parsed from the site’s robots.txt. Initially empty; populated by setForbidden() once robots.txt has been fetched.
  • seen — a Set<string> of URLs already visited or queued. Used to deduplicate the crawl queue so the same page is never fetched twice.
  • site — the raw site string passed to the constructor, stored for use in getURL().
  • decodeURL — a URL object constructed from site, used to extract hostname and origin for URL matching.
  • axiosInstance — the Axios instance created with axios.create(Object.assign({ responseType: 'stream' }, config ?? {})).
  • subscriber — the RxJS subscriber object captured inside the Observable source function when subscribe() is first called. Used internally to emit next, error, and complete notifications.

Example

import Spinney from 'spinney';
import { AxiosRequestConfig } from 'axios';

const config: AxiosRequestConfig = {
  timeout: 15000,
  headers: { 'User-Agent': 'MyBot/1.0 (+https://mysite.com/bot)' },
};

const spinney = new Spinney(
  'https://example.com/',
  { debug: true, overide: false },
  config
);
isProcessing starts as false at construction time. It is set to true by resume() only after the robots.txt response has been received and parsed inside setUp(). The internal batch loop (_setUp) checks isProcessing before processing each URL batch — no crawling occurs until resume() has been called.

Build docs developers (and LLMs) love