Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/cobyeastwood/spinney/llms.txt

Use this file to discover all available pages before exploring further.

Beyond subscribe(), the Spinney class exposes several public utility methods. Some — getURL, isApproved, getApproved, and isMatch — are used internally during the crawl but are part of the public interface and can be called directly for testing, custom filtering logic, or introspection. Others — pause and resume — directly control the crawl’s state machine by toggling the isProcessing flag that gates the internal batch loop.

pause()

spinney.pause(): void
Sets isProcessing to false, which prevents the internal _setUp batch loop from scheduling any further URL processing. Called automatically when the crawl completes (subscriber.complete() has been called) or when a fatal error causes the Observable to terminate. You can also call it manually to temporarily halt crawling.
// Pause the scraper temporarily
spinney.pause();

resume()

spinney.resume(): void
Sets isProcessing to true, allowing the batch loop to begin or continue processing URL batches. This is called automatically by setUp() immediately after the robots.txt response is received and parsed. Calling resume() after a fatal error will set the flag but will not restart a terminated Observable — you must construct a new Spinney instance to begin a fresh crawl.

getURL()

spinney.getURL(pathname: string): string
Resolves a path or URL string to an absolute URL relative to the scraper’s base site. The resolution rules are:
  • If pathname starts with // (two forward slashes), the first / is stripped with pathname.slice(1) and the result is set as the pathname of a new URL built from the base site.
  • If pathname starts with / (single forward slash), it is set directly as the pathname of a new URL built from the base site.
  • If pathname does not start with / (e.g. an absolute URL like https://other.com), it is returned unchanged.

Parameters

pathname
string
required
A root-relative path (e.g. /about), double-slash path (e.g. //collections), or absolute URL (e.g. https://other.com).

Returns

string — a fully qualified URL string.

Throws

Throws a TypeError with the message 'pathname is not type string' if pathname is not a string.

Examples

const spinney = new Spinney('https://www.example.com/');

spinney.getURL('/path');              // => 'https://www.example.com/path'
spinney.getURL('https://other.com'); // => 'https://other.com'

isMatch()

spinney.isMatch(testPathname: string, basePathname: string): boolean
Tests whether basePathname (a full URL string) matches the pattern defined by testPathname (a disallow path entry from robots.txt). First checks RegExps.ForwardSlashWord.test(testPathname) — if that returns false, isMatch returns false immediately. If it passes, the method finds the index of '/' in testPathname. If the index is found (i.e. Not(index === -1) is true), it slices from that index to get the pathname portion and builds a RegExp via RegExps.getHostnameAndPathname(hostname, pathname). Otherwise it builds the RegExp from testPathname directly. The RegExp is then tested against basePathname.

Parameters

testPathname
string
required
A robots.txt-style disallow path, such as /private or /*/collections/name. Patterns are taken directly from the robots.txt Disallow: lines.
basePathname
string
required
A full absolute URL to test against the pattern, e.g. 'https://www.example.com/dontdoit/collections/name'.

Returns

booleantrue if basePathname matches the pattern built from testPathname and the scraper’s hostname. Returns false if testPathname contains no path segment matching RegExps.ForwardSlashWord.

Examples

const spinney = new Spinney('https://www.example.com/');

spinney.isMatch(
  '/*/collections/name',
  'https://www.example.com/dontdoit/collections/name'
); // => true

spinney.isMatch(
  '/*/collections/name',
  'https://www.example.com/dontdoit/collections'
); // => false

isApproved()

spinney.isApproved(site: string): boolean
Returns true if site passes all three approval checks:
  1. It is a syntactically valid URL (tested against RegExps.getURL()).
  2. Its hostname or origin starts with the base site’s hostname or origin — i.e. it belongs to the same domain.
  3. It has not been visited before (not in seen) and is not forbidden by any robots.txt Disallow rule (checked via isForbidden()). As a side effect, an approved URL is immediately added to the seen Set to prevent re-queuing.
Used internally to filter candidate URLs before adding them to the crawl batch queue.

Parameters

site
string
required
An absolute URL string to evaluate, e.g. 'https://www.example.com/about'.

Returns

booleantrue if the URL is valid, on the same domain, unseen, and not forbidden.

getApproved()

spinney.getApproved(hrefs: string[]): string[]
Accepts an array of raw href attribute values collected from a single HTML page, runs each one through getURL() to resolve relative paths to absolute URLs, then filters the results through isApproved(). The returned array contains only the URLs that are valid, same-domain, unseen, and not robots.txt-forbidden. This is the primary mechanism by which Spinney builds the next batch of URLs to crawl.

Parameters

hrefs
string[]
required
Raw href attribute values collected from an HTML page, e.g. ['/about', 'https://example.com/blog', '#anchor', 'https://external.com'].

Returns

string[] — filtered array of approved, fully-qualified absolute URLs ready to be added to the crawl queue.

toArray()

spinney.toArray(data: any): any[]
Returns data wrapped in an array. If data is already an array it is returned as-is; otherwise it is wrapped in [data]. Used internally to normalise values before array operations.

Parameters

data
any
required
Any value. Arrays are passed through; all other values are wrapped.

Returns

any[]

isArrayEmpty()

spinney.isArrayEmpty(data: any): boolean
Returns true if data is not an array, or if it is an array with a length of 0. Implemented as Not(Array.isArray(data)) || data.length === 0, where Not(condition) returns true when condition is strictly false. Used internally as the base case for the recursive _setUp batch loop — when the pending sites array is empty, the crawl completes and subscriber.complete() is called.

Parameters

data
any
required
The value to check. Typically the current batch of pending URLs.

Returns

booleantrue if data is not an array or is an empty array.

setForbidden()

spinney.setForbidden({ forbidden }: { forbidden: Set<string> }): void
Assigns a new Set<string> of disallowed path patterns to the instance’s forbidden field, replacing any previously stored set. Called automatically by setUp() with the result of parsing the site’s robots.txt via ParseText. Can also be called directly if you need to inject a custom forbidden set before or between crawls.

Parameters

forbidden
Set<string>
required
A Set of path strings matching robots.txt Disallow: entries, e.g. new Set(['/private', '/admin']).

Returns

void

Build docs developers (and LLMs) love