Pages and links

The Crawl object exposes first-class views so you can work with pages and links at a high level. These views are backed by DuckDB fast paths when a cache exists, and fall back to the Derby source backend automatically.

Page view

crawl.pages() returns a PageView — a sitewide mapped page view backed by the internal page model.

from screamingfrog import Crawl

crawl = Crawl.load("./crawl.dbseospider")

# Iterate all pages
for row in crawl.pages():
    print(row["Address"], row["Status Code"])

# Filter to 404s and collect as a list
pages_404 = crawl.pages().filter(status_code=404).collect()

crawl.pages() reads from the internal model directly. It does not force internal_all tab materialisation on a cold DuckDB cache, so lightweight page workflows stay fast.

Narrow projections with `.select()`

Pass explicit field names to avoid pulling every column. This projects through a shared helper relation on DuckDB caches:

lightweight = (
    crawl.pages()
    .select("Address", "Status Code", "Title 1")
    .collect()
)

projected = (
    crawl.pages()
    .select("Address", "Status Code", "Title 1")
    .filter(status_code=404)
    .collect()
)

`crawl.internal` — typed InternalView

crawl.internal is a property that returns an InternalView, yielding InternalPage objects instead of plain dicts:

for page in crawl.internal.filter(status_code=404):
    print(page.address)

On Derby-backed crawls, crawl.internal also materialises computed mapped fields such as Indexability and Indexability Status.

Link views

crawl.links(direction) returns a LinkView backed by the cached link tabs when available, or the source backend when the cache is lean.

# Inlinks (pages pointing in)
nofollow_inlinks = crawl.links("in").filter(rel="nofollow").collect()

# Outlinks (pages pointing out)
outlinks = crawl.links("out").collect()

Narrow link projections

broken_inlinks = (
    crawl.links("in")
    .select("Source", "Address", "Status Code")
    .filter(status_code=404)
    .collect()
)

Per-URL inlinks and outlinks

For Derby-backed crawls, use crawl.inlinks(url) and crawl.outlinks(url) to read links for a specific URL directly:

for link in crawl.inlinks("https://example.com/page"):
    if link.data.get("NoFollow"):
        print(link.source, "->", link.destination, link.data.get("Rel"))

Section views

crawl.section(prefix) scopes any page or link query to a URL path prefix:

blog_pages = crawl.section("/blog").pages().collect()
blog_outlinks = crawl.section("/blog").links("out").collect()

# Access a specific tab scoped to the section
blog_inlinks = crawl.section("/blog").tab("all_inlinks").collect()

Pass a path prefix like /blog for broad matching, or a full URL prefix like https://example.com/blog for host-specific scoping.

Search

crawl.search(term, fields) searches across the sitewide page view:

matching_pages = crawl.search("canonical", fields=["Address", "Title 1"]).collect()

# Search nofollow links
nofollow_links = crawl.links("in").search("nofollow", fields=["Follow"]).collect()

Omit fields to search all available columns. Pass case_sensitive=True to use exact case matching.

View methods reference

All views share a consistent set of methods:

.filter(**kwargs)

Apply column filters. Returns the same view type, so calls are chainable.

.select(*fields)

Project a subset of fields. Available on PageView and LinkView.

.count()

Return the number of matching rows without collecting them.

.collect()

Materialise all matching rows as a Python list.

.first()

Return the first matching row, or None.

.to_pandas() / .to_polars()

Convert results to a pandas or polars DataFrame (requires optional dependency).

Crawl summary

crawl.summary() returns a dict of high-level crawl counts for monitoring and automation:

summary = crawl.summary()
print(summary)
# {
#   "pages": 4832,
#   "tabs": 52,
#   "broken_pages": 14,
#   ...
# }

Core counts (pages, tabs) are always populated. Issue-family and chain totals may be None on lean DuckDB caches until those tab families are materialised.

Full example: 404 pages and broken inlinks

from screamingfrog import Crawl

crawl = Crawl.load("./crawl.dbseospider")

# All 404 pages
pages_404 = crawl.pages().filter(status_code=404).collect()
for row in pages_404:
    print(row["Address"])

# Inlinks pointing to 404 pages
broken_inlinks = (
    crawl.links("in")
    .select("Source", "Address", "Status Code")
    .filter(status_code=404)
    .collect()
)

# Nofollow inlinks
nofollow_inlinks = crawl.links("in").filter(rel="nofollow").collect()

# Blog section pages
blog_pages = crawl.section("/blog").pages().collect()

Get Started

Loading Crawls

Querying Data

Audit & Reports

Tooling

Page view

Narrow projections with `.select()`

`crawl.internal` — typed InternalView

Link views

Narrow link projections

Per-URL inlinks and outlinks

Section views

Search

View methods reference

.filter(**kwargs)

.select(*fields)

.count()

.collect()

.first()

.to_pandas() / .to_polars()

Crawl summary

Full example: 404 pages and broken inlinks

Build docs developers (and LLMs) love

Get Started

Loading Crawls

Querying Data

Audit & Reports

Tooling

Documentation Index

​Page view

​Narrow projections with .select()

​crawl.internal — typed InternalView

​Link views

​Narrow link projections

​Per-URL inlinks and outlinks

​Section views

​Search

​View methods reference

.filter(**kwargs)

.select(*fields)

.count()

.collect()

.first()

.to_pandas() / .to_polars()

​Crawl summary

​Full example: 404 pages and broken inlinks

Build docs developers (and LLMs) love

Page view

Narrow projections with `.select()`

`crawl.internal` — typed InternalView

Link views

Narrow link projections

Per-URL inlinks and outlinks

Section views

Search

View methods reference

Crawl summary

Full example: 404 pages and broken inlinks