Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/cobyeastwood/spinney/llms.txt

Use this file to discover all available pages before exploring further.

Spinney ships its type definitions in lib/index.d.ts, generated from the TypeScript source at build time. Only the Spinney class itself is part of the public API — module.exports = Spinney is the sole export. The Options type, internal helper classes (ParseText, ParseXML, StringWritable), and the constants module are not re-exported, but they are documented here for completeness and for developers who want to understand how the library works under the hood.

Options

Options is the type of the second argument to the Spinney constructor. It is defined in src/types.ts but is not re-exported from the package entry point, so it cannot be imported with import type { Options } from 'spinney' in application code. Use it as an inline type annotation or reference the shape directly.
type Options = {
  overide?: boolean; // default: false
  debug?: boolean;   // default: false
};
overide
boolean
When true, bypasses all robots.txt Disallow checks. Every URL on the target domain is treated as crawlable regardless of what robots.txt says. Defaults to false.
The property name is spelled overide (single r) in the source and type definition. This is an intentional quirk of the library — using override (double r) will silently have no effect.
debug
boolean
When true, non-fatal internal errors are written to stderr via console.error(error?.message). This includes HTTP errors that are retried and URL parsing failures that are swallowed. Defaults to false.

Usage

import Spinney from 'spinney';

const spinney = new Spinney('https://example.com/', {
  overide: false,
  debug: true,
});

Constants

These values are defined in src/constants.ts and used internally throughout the library.

MAX_RETRIES

const MAX_RETRIES = 5;
The maximum number of times Spinney will retry a failing HTTP request before giving up. On each retry the timeout is stepped up by (retries * 1000) / 4 milliseconds. When the retry count reaches MAX_RETRIES, a fatal error is thrown and the Observable’s error callback is invoked. HTTP 404 responses are treated as permanent and do not consume a retry slot — they resolve immediately.

RegExps

A collection of pre-compiled regular expressions and factory functions used internally for robots.txt parsing and URL matching.
KeyPattern / Description
Allow/^([Aa]llow:) (\\/.+)$/g — matches Allow: lines in robots.txt
Disallow/^([Dd]isallow:) (\\/.+)$/g — matches Disallow: /path lines in robots.txt
Host/^([Hh]ost:) (.+)$/g — matches Host: lines in robots.txt
NewLine/[^\\r\\n]+/g — splits robots.txt byte chunks into individual lines
SiteMap/^([Ss]itemap:) (.+)$/ — matches Sitemap: https://... lines in robots.txt
SpecialCharacter/[^a-zA-Z0-9 ]/g — matches non-alphanumeric characters
UserAgent/^([Uu]ser-[Aa]gent:) (.+)$/g — matches User-agent: lines to detect * (all bots) blocks
ForwardSlashWord/\\/(\\w+)/gi — matches path segments beginning with /; used in isMatch() to validate that a test path has at least one segment
HttpOrHttps/[-a-zA-Z0-9@:%._+~#=]{1,256}... — matches HTTP or HTTPS URLs within text
getURL()Factory function — returns a new RegExp that validates whether a string is a syntactically correct absolute URL. Called by isApproved() on every candidate URL.
getHostnameAndPathname(hostname, pathname)Factory function — returns new RegExp('(.*\\.)?<hostname>.*(<pathname>)'). Built dynamically in isMatch() using the scraper’s hostname and the disallow path being tested.

Internal classes

These classes are not exported from lib/index.jsmodule.exports = Spinney is the only export. They are instantiated privately inside Spinney and documented here for developers reading the source.

ParseText

Located at src/ParseText.ts. Processes a streaming robots.txt response. Its write(chunk: Buffer) method is registered on the Axios data stream’s 'data' event. Each chunk is converted to a string, split on newlines using RegExps.NewLine, and each line is passed through three handlers:
  • onSiteMap(line) — if the line matches RegExps.SiteMap, extracts the sitemap URL and sets isSiteMap = true.
  • onUserAgent(line) — if the line matches RegExps.UserAgent, toggles isParsing based on whether the agent value is * (all crawlers).
  • onDisallow(line) — if isParsing is true and the line matches RegExps.Disallow, extracts the disallowed path and adds it to the forbidden Set.
Calling end() returns { forbidden, site, isSiteMap } and resets the instance. The forbidden Set is passed to Spinney.setForbidden() to populate the instance-level forbidden Set used by _isForbidden(). Instance fields:
FieldTypeDescription
sitestringThe sitemap URL extracted from the Sitemap: line, if present.
forbiddenSet<string>Disallowed paths collected from Disallow: lines for the * user-agent block.
isParsingbooleantrue while inside a User-agent: * block; controls whether Disallow: lines are collected.
isSiteMapbooleantrue if a Sitemap: line was found and a URL was successfully extracted.

ParseXML

Located at src/ParseXML.ts. Wraps xml2js.parseStringPromise to extract URLs from sitemap XML documents. Its promise(data: string) method accepts the full buffered XML string and returns a Promise that resolves to { sites: string[] }. Supports two XML formats:
  • <sitemapindex> — iterates raw.sitemapindex.sitemap and collects each <loc> value. Used for sitemap index files that reference child sitemaps.
  • <urlset> — iterates raw.urlset.url and collects each <loc> value. Used for standard sitemap files listing page URLs.
The returned sites array is fed back into _setUp() as the next URL batch to crawl. Instance fields:
FieldTypeDescription
context{ sites: string[] }Accumulates the list of URL strings extracted from the parsed XML sitemap.

StringWritable

Located at src/StringWritable.ts. A node:stream.Writable subclass that accumulates streamed response chunks into a single string. Uses a StringDecoder from node:string_decoder to correctly handle multi-byte UTF-8 characters that may span chunk boundaries. The accumulated string is available on .string after the stream finishes. Used to buffer the full response body of XML sitemap requests before passing the complete string to ParseXML.promise(). Instance fields:
FieldTypeDescription
stringanyThe accumulated response body string, built up as chunks arrive.
decodeStringDecoderNode.js StringDecoder instance used to handle multi-byte character boundaries between chunks.

Not(condition)

Located at src/utils/Not.ts.
function Not(condition: boolean): boolean {
  return condition === false;
}
A small readability utility that returns true when condition is strictly false (using === equality, not logical negation). Used throughout the Spinney source as a readable alternative to the ! operator — for example, if (Not(index === -1)) reads as “if the index was found”.
ParseText, ParseXML, StringWritable, and Not are internal implementation details. They are not exported from lib/index.js and should not be imported directly in application code. Their interfaces may change between minor versions without a semver-breaking change.

Build docs developers (and LLMs) love