Use this file to discover all available pages before exploring further.
DuckDB is the default analysis engine for DB-backed crawl workflows. Derby remains the crawl source-of-truth — DuckDB is a fast analytics layer on top of it.
DuckDB executes analytical queries over wide page/link tables orders of magnitude faster than Derby’s OLTP engine.
Zero-dependency portability
A .duckdb file is a single portable file. Share it with teammates or load it on any machine without installing Java or Screaming Frog.
Lean cold-cache start
Default DB-backed loads create a tiny sidecar DuckDB first and keep Derby prewarmed as the lazy source backend. Heavier relations materialise only when you ask for them.
Namespace support
One .duckdb file can hold multiple crawls under separate namespaces — useful for portfolio reporting and crawl-over-crawl diffs.
The default if_exists="auto" rebuilds the DuckDB cache only when the Derby source has changed (detected via source fingerprint). Subsequent loads that find a fresh cache skip the export step entirely.
# Safe to call repeatedly — rebuilds only when Derby source changedderby_crawl.export_duckdb("./crawl.duckdb", if_exists="auto")
A single .duckdb file can store multiple crawls in separate namespaces:
# Export two crawls into the same filederby_crawl.export_duckdb("./portfolio.duckdb", namespace="client-a", if_exists="auto")other_crawl.export_duckdb("./portfolio.duckdb", namespace="client-b", if_exists="auto")# List available namespacesnamespaces = Crawl.duckdb_namespaces("./portfolio.duckdb")print(namespaces) # ['client-a', 'client-b']# Load a specific crawl by namespaceclient_a = Crawl.from_duckdb("./portfolio.duckdb", namespace="client-a")
When a DuckDB cache is lean (freshly created or not yet fully materialised), the library uses smart fallback paths:
1
Tiny sidecar DuckDB created
A lean sidecar DuckDB is created first. Derby remains prewarmed as the lazy source backend.
2
Source-backed projections on first use
crawl.pages().select(...) and crawl.links(...).select(...) read directly from the Derby source backend via one-shot projections, avoiding wide internal_all / all_inlinks materialisation on first use.
3
Narrow helper relations materialised on demand
When DuckDB does need cached subsets, it materialises narrow helper relations (internal_common, links_core) instead of exporting full wide tables.
4
Full tab materialisation on explicit request
Heavier relations (e.g. full internal_all, all_inlinks) are written to DuckDB only when explicitly requested via tabs="all" or a direct crawl.tab(...) call that triggers a cache miss.
Export directly from a Derby path or DB crawl ID without loading the full Crawl object:
from screamingfrog.db.duckdb import ( export_duckdb_from_derby, export_duckdb_from_db_id,)# Export from a .dbseospider pathexport_duckdb_from_derby( "./crawl.dbseospider", "./crawl.duckdb", tabs="all", if_exists="auto",)# Export directly from an internal DB crawl IDexport_duckdb_from_db_id( "138edb21-61d0-41cd-9e9b-725b592a471c", "./crawl.duckdb", tabs="all", if_exists="auto",)
from screamingfrog import Crawl# 1. Load Derby crawl and export to DuckDBderby_crawl = Crawl.load("./crawl.dbseospider", dbseospider_backend="derby", csv_fallback=False)derby_crawl.export_duckdb("./crawl.duckdb", if_exists="auto")# 2. Load fast DuckDB-backed crawlfast = Crawl.load("./crawl.duckdb")# 3. Run analyticspages_404 = fast.pages().filter(status_code=404).collect()lightweight = fast.pages().select("Address", "Status Code", "Title 1").collect()broken_inlinks = fast.links("in").select("Source", "Address", "Status Code").filter(status_code=404).collect()matching_pages = fast.search("canonical", fields=["Address", "Title 1"]).collect()# 4. Portfolio: two crawls in one fileother_crawl = Crawl.load("./other.dbseospider", dbseospider_backend="derby", csv_fallback=False)derby_crawl.export_duckdb("./portfolio.duckdb", namespace="client-a", if_exists="auto")other_crawl.export_duckdb("./portfolio.duckdb", namespace="client-b", if_exists="auto")client_a = Crawl.from_duckdb("./portfolio.duckdb", namespace="client-a")client_b = Crawl.from_duckdb("./portfolio.duckdb", namespace="client-b")
DuckDB caches are derived from the Derby source-of-truth. If you re-crawl a site, re-export the DuckDB cache or use if_exists="auto" to trigger an automatic refresh when the Derby source fingerprint changes.