BeyondDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/cobyeastwood/spinney/llms.txt
Use this file to discover all available pages before exploring further.
subscribe(), the Spinney class exposes several public utility methods. Some — getURL, isApproved, getApproved, and isMatch — are used internally during the crawl but are part of the public interface and can be called directly for testing, custom filtering logic, or introspection. Others — pause and resume — directly control the crawl’s state machine by toggling the isProcessing flag that gates the internal batch loop.
pause()
isProcessing to false, which prevents the internal _setUp batch loop from scheduling any further URL processing. Called automatically when the crawl completes (subscriber.complete() has been called) or when a fatal error causes the Observable to terminate. You can also call it manually to temporarily halt crawling.
resume()
isProcessing to true, allowing the batch loop to begin or continue processing URL batches. This is called automatically by setUp() immediately after the robots.txt response is received and parsed. Calling resume() after a fatal error will set the flag but will not restart a terminated Observable — you must construct a new Spinney instance to begin a fresh crawl.
getURL()
- If
pathnamestarts with//(two forward slashes), the first/is stripped withpathname.slice(1)and the result is set as the pathname of a newURLbuilt from the base site. - If
pathnamestarts with/(single forward slash), it is set directly as the pathname of a newURLbuilt from the base site. - If
pathnamedoes not start with/(e.g. an absolute URL likehttps://other.com), it is returned unchanged.
Parameters
A root-relative path (e.g.
/about), double-slash path (e.g. //collections), or absolute URL (e.g. https://other.com).Returns
string — a fully qualified URL string.
Throws
Throws aTypeError with the message 'pathname is not type string' if pathname is not a string.
Examples
isMatch()
basePathname (a full URL string) matches the pattern defined by testPathname (a disallow path entry from robots.txt). First checks RegExps.ForwardSlashWord.test(testPathname) — if that returns false, isMatch returns false immediately. If it passes, the method finds the index of '/' in testPathname. If the index is found (i.e. Not(index === -1) is true), it slices from that index to get the pathname portion and builds a RegExp via RegExps.getHostnameAndPathname(hostname, pathname). Otherwise it builds the RegExp from testPathname directly. The RegExp is then tested against basePathname.
Parameters
A robots.txt-style disallow path, such as
/private or /*/collections/name. Patterns are taken directly from the robots.txt Disallow: lines.A full absolute URL to test against the pattern, e.g.
'https://www.example.com/dontdoit/collections/name'.Returns
boolean — true if basePathname matches the pattern built from testPathname and the scraper’s hostname. Returns false if testPathname contains no path segment matching RegExps.ForwardSlashWord.
Examples
isApproved()
true if site passes all three approval checks:
- It is a syntactically valid URL (tested against
RegExps.getURL()). - Its
hostnameororiginstarts with the base site’shostnameororigin— i.e. it belongs to the same domain. - It has not been visited before (not in
seen) and is not forbidden by any robots.txtDisallowrule (checked viaisForbidden()). As a side effect, an approved URL is immediately added to theseenSet to prevent re-queuing.
Parameters
An absolute URL string to evaluate, e.g.
'https://www.example.com/about'.Returns
boolean — true if the URL is valid, on the same domain, unseen, and not forbidden.
getApproved()
href attribute values collected from a single HTML page, runs each one through getURL() to resolve relative paths to absolute URLs, then filters the results through isApproved(). The returned array contains only the URLs that are valid, same-domain, unseen, and not robots.txt-forbidden. This is the primary mechanism by which Spinney builds the next batch of URLs to crawl.
Parameters
Raw
href attribute values collected from an HTML page, e.g. ['/about', 'https://example.com/blog', '#anchor', 'https://external.com'].Returns
string[] — filtered array of approved, fully-qualified absolute URLs ready to be added to the crawl queue.
toArray()
data wrapped in an array. If data is already an array it is returned as-is; otherwise it is wrapped in [data]. Used internally to normalise values before array operations.
Parameters
Any value. Arrays are passed through; all other values are wrapped.
Returns
any[]
isArrayEmpty()
true if data is not an array, or if it is an array with a length of 0. Implemented as Not(Array.isArray(data)) || data.length === 0, where Not(condition) returns true when condition is strictly false. Used internally as the base case for the recursive _setUp batch loop — when the pending sites array is empty, the crawl completes and subscriber.complete() is called.
Parameters
The value to check. Typically the current batch of pending URLs.
Returns
boolean — true if data is not an array or is an empty array.
setForbidden()
Set<string> of disallowed path patterns to the instance’s forbidden field, replacing any previously stored set. Called automatically by setUp() with the result of parsing the site’s robots.txt via ParseText. Can also be called directly if you need to inject a custom forbidden set before or between crawls.
Parameters
A Set of path strings matching
robots.txt Disallow: entries, e.g. new Set(['/private', '/admin']).Returns
void