Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Amaculus/screaming-frog-api/llms.txt

Use this file to discover all available pages before exploring further.

InternalView

Returned by crawl.internal. Backed by the internal page model and yields InternalPage objects.
for page in crawl.internal.filter(status_code=404):
    print(page.address, page.status_code)

pages = crawl.internal.filter(status_code=200).collect()
count = crawl.internal.count()
Derby-backed crawl.internal materializes computed mapped fields such as Indexability and Indexability Status. DuckDB-backed crawls read these from the cached internal relation.

Methods

.filter(**kwargs)InternalView

Narrow results by column value. Keys are column names or snake_case equivalents.
crawl.internal.filter(status_code=404)
crawl.internal.filter(indexability="Non-Indexable")
**kwargs
Any
Column name / value pairs. Values are matched by equality.

.search(term, *, fields, case_sensitive)SearchInternalView

Search string fields across internal pages.
term
str
required
Search string.
fields
Sequence[str] | None
default:"None"
Column names to search. Searches all string fields when None.
case_sensitive
bool
default:"False"
Case-sensitive matching.

.count()int

Return the number of matching pages.

.collect()list[InternalPage]

Materialize all matching pages into a list.

.first()InternalPage | None

Return the first matching page, or None if the view is empty.

.to_pandas() / .to_polars()

Return a pandas or Polars DataFrame. Requires the respective library to be installed.

TabView

Returned by crawl.tab(name). Yields rows as dict[str, Any].
for row in crawl.tab("response_codes_all"):
    print(row["Address"], row["Status Code"])

# GUI filter shortcut
for row in crawl.tab("page_titles").filter(gui="Missing"):
    print(row["Address"])

Methods

.filter(**kwargs)TabView

Filter by column value. Supports a special gui="..." keyword for applying named GUI filters.
gui
str
Named GUI filter to apply (e.g. "Missing", "Duplicate"). Use crawl.tab_filters(name) to list available filter names for a tab.
gui_filters
list[str]
Apply multiple GUI filters at once.
**kwargs
Any
Additional column name / value pairs for equality filtering.

.search(term, *, fields, case_sensitive)SearchRowView

term
str
required
Search string.
fields
Sequence[str] | None
default:"None"
Column names to search.
case_sensitive
bool
default:"False"
Case-sensitive matching.

.count()int

Return the total number of matching rows.

.collect()list[dict[str, Any]]

Materialize all matching rows into a list.

.first()dict[str, Any] | None

Return the first matching row, or None.

.to_pandas() / .to_polars()

Return a pandas or Polars DataFrame.

PageView

Returned by crawl.pages(). Backed by the internal page model and yields rows as dict[str, Any]. Use .select() to project a narrow field subset.
pages = crawl.pages().filter(status_code=404).collect()

Methods

.filter(**kwargs)PageView

Narrow pages by column value.
**kwargs
Any
Column name / value pairs.

.select(*fields)ProjectedPageView

Project a subset of fields. Avoids materializing the full internal page model when only a few columns are needed.
lightweight = crawl.pages().select("Address", "Status Code", "Title 1").collect()
*fields
str
required
One or more field names to include. At least one field is required.

.search(term, *, fields, case_sensitive)SearchRowView

term
str
required
Search string.
fields
Sequence[str] | None
default:"None"
Column names to search.
case_sensitive
bool
default:"False"
Case-sensitive matching.

.count()int

Return the number of matching pages.

.collect()list[dict[str, Any]]

Materialize all matching rows.

.first()dict[str, Any] | None

Return the first matching row.

.to_pandas() / .to_polars()

Return a pandas or Polars DataFrame.

ProjectedPageView

Returned by crawl.pages().select(...). Behaves like PageView but only returns the selected fields. DuckDB-backed crawls use a narrow helper relation to avoid full internal_all materialization.
result = crawl.pages().select("Address", "Status Code", "Title 1").filter(status_code=404).collect()
Supports the same methods as PageView: .filter(), .search(), .count(), .collect(), .first(), .to_pandas(), .to_polars().

LinkView

Returned by crawl.links(direction). Yields link rows as dict[str, Any].
inlinks = crawl.links("in").filter(status_code=404).collect()
nofollow = crawl.links("in").search("nofollow", fields=["Follow"]).collect()

Methods

.filter(**kwargs)LinkView

**kwargs
Any
Column name / value pairs for equality filtering.

.select(*fields)ProjectedLinkView

Project a field subset. Avoids materializing wide inlink/outlink tabs on lean DuckDB caches.
broken_inlinks = crawl.links("in").select("Source", "Address", "Status Code").filter(status_code=404).collect()
*fields
str
required
One or more field names. At least one is required.

.search(term, *, fields, case_sensitive)SearchRowView

term
str
required
Search string.
fields
Sequence[str] | None
default:"None"
Column names to search.
case_sensitive
bool
default:"False"
Case-sensitive matching.

.count()int

.collect()list[dict[str, Any]]

.first()dict[str, Any] | None

.to_pandas() / .to_polars()


ProjectedLinkView

Returned by crawl.links(...).select(...). Behaves like LinkView but only returns selected fields. Supports the same methods as LinkView: .filter(), .search(), .count(), .collect(), .first(), .to_pandas(), .to_polars().

CrawlSection

Returned by crawl.section(prefix). Scopes page, link, and tab views to a URL prefix.
blog = crawl.section("/blog")
blog_pages = blog.pages().collect()
blog_outlinks = blog.links("out").collect()
blog_inlinks_tab = blog.tab("all_inlinks").collect()

Methods

.pages()ScopedRowView

Return a scoped page view matched by Address, URL Encoded Address, or Encoded URL.
Return a scoped link view. For inlinks, scope matches on Address, Destination, or To. For outlinks, scope matches on Source, From, or Address.
direction
str
default:"out"
"in" or "out".

.tab(name, fields=None)ScopedRowView

Return a scoped view of any tab. By default, scope matches against common URL-bearing columns (Address, Source, Destination, URL, From, To, etc.).
name
str
required
Tab name.
fields
Sequence[str] | None
default:"None"
Override the URL fields used for prefix matching.

QueryView

Returned by crawl.query(schema, table). Provides a chainable SQL builder against raw backend tables. DB-backed crawls only.
rows = (
    crawl.query("APP", "URLS")
    .select("ENCODED_URL", "RESPONSE_CODE", "TITLE_1")
    .where("RESPONSE_CODE >= ?", 400)
    .order_by("RESPONSE_CODE DESC", "ENCODED_URL ASC")
    .limit(100)
    .collect()
)

Methods

.select(*columns)QueryView

Set the columns to select. Defaults to *.
*columns
str
required
One or more column names or SQL expressions.

.where(sql_fragment, *params)QueryView

Add a WHERE clause. Multiple calls are AND-combined.
sql_fragment
str
required
SQL fragment. Use ? for parameterized values.
*params
Any
Positional parameters for ? placeholders.

.group_by(*columns)QueryView

Add a GROUP BY clause.
*columns
str
required
One or more column names.

.having(sql_fragment, *params)QueryView

Add a HAVING clause. Multiple calls are AND-combined.
sql_fragment
str
required
SQL fragment.
*params
Any
Positional parameters.

.order_by(*clauses)QueryView

Add an ORDER BY clause.
*clauses
str
required
Column names or COLUMN ASC|DESC expressions.

.limit(n)QueryView

Limit the number of rows returned. Pass None to remove an existing limit.
n
int | None
required
Maximum number of rows. Must be a positive integer when not None.

.to_sql()tuple[str, list[Any]]

Return the generated SQL string and parameter list without executing the query.
sql, params = crawl.query("APP", "URLS").where("RESPONSE_CODE = ?", 404).to_sql()
print(sql)

.collect()list[dict[str, Any]]

Execute the query and return all rows.

.first()dict[str, Any] | None

Execute the query with LIMIT 1 and return the first row.

.to_pandas() / .to_polars()

Return a pandas or Polars DataFrame.

Data models

InternalPage

Represents a single row from the Internal tab. Returned by InternalView.
address
str
The page URL.
status_code
int | None
HTTP status code.
id
int | None
Internal row ID (available on DB-backed crawls).
data
dict[str, Any]
All raw fields from the internal tab, keyed by column name.
Class methods:
  • InternalPage.from_csv_row(row: Mapping[str, Any])InternalPage
  • InternalPage.from_db_row(columns: list[str], values: tuple[Any, ...])InternalPage
  • InternalPage.from_data(data: Mapping[str, Any], *, copy_data: bool = True)InternalPage

Represents a single inlink or outlink row. Returned by crawl.inlinks() and crawl.outlinks().
source
str | None
The source URL of the link.
destination
str | None
The destination URL of the link.
anchor_text
str | None
The link anchor text.
data
dict[str, Any]
All raw fields from the link row.
Class method:
  • Link.from_row(row: Mapping[str, Any])Link

CrawlInfo

Metadata for a DB-mode crawl in the ProjectInstanceData directory. Returned by list_crawls().
db_id
str
The crawl UUID folder name.
url
str
The crawl start URL.
urls_crawled
int
Number of crawled URLs.
percent_complete
float
Crawl completion percentage.
modified
datetime
Last modified timestamp (UTC).
path
Path
Absolute path to the crawl folder.

Build docs developers (and LLMs) love