Browser and Web Search Engine Internals

Spy Search’s web retrieval layer is split into two distinct modes: a fast DuckDuckGo path optimised for sub-1.5-second round trips, and a deep Playwright-based crawl path that trades latency for thoroughness. The right mode is chosen by which agent is active in the pipeline — Quick_searcher for speed, Search_agent for depth.

DuckDuckGo Search (`DuckSearch`)

File: src/browser/duckduckgo.py DuckSearch wraps langchain_community.tools.DuckDuckGoSearchResults and adds an async content-extraction layer with aggressive timeout budgets so that the overall search-to-result latency stays at or under 1.5 seconds.

`search_result(query, k=6)`

The primary search method. It:

Calls DuckDuckGoSearchResults (text backend) to fetch the top-k results. Each result contains title, snippet, and link.
Spawns async content-extraction tasks for each URL using aiohttp.
Returns results enriched with a full_content field — a 300-character extract of the page’s first meaningful paragraph text.

results = duck.search_result("latest transformer architectures", k=6)
# Each result: {"title": "...", "snippet": "...", "link": "...", "full_content": "..."}

`today_new(category)`

Fetches the latest news headlines using the DuckDuckGo news backend. Accepts a category string ("technology", "finance", "entertainment", "sports", "world", "health") and returns up to 8 headlines.

headlines = duck.today_new("technology")

Ultra-fast content extraction

The _extract_content_fast(session, url) coroutine is the core of the speed strategy:

Reads only the first 20 KB of each response body — enough to capture the article lede without downloading the full page.
Extracts the first 15 <p> tags using selectolax, stopping early once 200 characters of usable text have been collected.
Falls back to a regex <p>…</p> scan if selectolax fails.
Trims the final output to 300 characters.

Timeout budget

All timeouts are tuned to keep the total wall-clock time under 1.5 seconds:

Budget	Value	Scope
Per-request connect	100 ms	TCP connection
Per-request read	300 ms	Response body
Per-request total	400 ms	Full single request
Gather timeout	1.2 s	All concurrent extractions
Hard ceiling	1.5 s	Entire `search_result()` call

If the 1.2-second asyncio.gather deadline is exceeded, content extraction is abandoned and results are returned with full_content: "". If the overall call exceeds 1.5 seconds, search_result returns an empty list to fail fast rather than block the pipeline.

Caching

DuckSearch maintains two in-process caches to avoid redundant network calls within a session:

_content_cache — a plain dict capped at 100 entries (oldest 20 evicted when full) mapping URLs to their extracted text.
_failed_urls — a set of URLs that previously returned non-200 responses or raised exceptions; these are skipped immediately on future calls.
_is_valid_url — an @lru_cache(maxsize=1000) decorator on the URL validation helper.

crawl4ai Deep Crawler (`Crawl`)

File: src/browser/crawl_ai.py Crawl wraps the crawl4ai library, which drives a Playwright headless Chromium browser. It is suited for JavaScript-rendered pages, paginated search results, and cases where a plain HTTP fetch would not yield meaningful content.

`get_url_llm(url, query)`

Navigates to url (typically a search-engine results page like https://google.com/search?q=…) and uses an LLMExtractionStrategy to instruct the configured model to identify the top 5 relevant links on that page relative to query. Returns a list of {url, title, description} objects.

urls = await crawl.get_url_llm(
    "https://google.com/search?q=transformer+architecture",
    "transformer architecture"
)

`get_summary(urls, query)`

Fetches each URL in the list (using arun_many for concurrency) and instructs the LLM to extract:

Field	Description
`title`	Main heading of the page
`summary`	300–400 word detailed summary relevant to `query`
`brief_summary`	2–3 sentence condensed summary
`keywords`	List of key phrases
`url`	Canonical page URL

PDF URLs are detected via Content-Type header (or the %PDF- magic bytes) and routed through get_pdf_summary(), which downloads the file, converts it to Markdown with markitdown, and summarises it with the LLM.

Choosing the Right Mode

Mode	Agent	Typical Speed	Content Depth
DuckDuckGo	`Quick_searcher`	~1.5 s	Snippet + 300-char page extract
Deep crawl (crawl4ai)	`Search_agent`	10–30 s	Full-page LLM summary (300–400 words)

Playwright requires browser binaries to be installed before the deep-crawl mode can function. Run playwright install chromium after installing Python dependencies. The provided Docker image and installation.sh setup script handle this step automatically.

Use the quick-searcher agent for interactive chat and real-time Q&A where latency matters. Switch to the searcher agent when generating in-depth research reports where content quality and completeness outweigh response time.

Getting Started

Configuration

Core Features

Architecture

Contributing

Browser and Web Search Engine Internals

DuckDuckGo Search (`DuckSearch`)

`search_result(query, k=6)`

`today_new(category)`

Ultra-fast content extraction

Timeout budget

Caching

crawl4ai Deep Crawler (`Crawl`)

`get_url_llm(url, query)`

`get_summary(urls, query)`

Choosing the Right Mode

Build docs developers (and LLMs) love

Getting Started

Configuration

Core Features

Architecture

Contributing

Documentation Index

​DuckDuckGo Search (DuckSearch)

​search_result(query, k=6)

​today_new(category)

​Ultra-fast content extraction

​Timeout budget

​Caching

​crawl4ai Deep Crawler (Crawl)

​get_url_llm(url, query)

​get_summary(urls, query)

​Choosing the Right Mode

Build docs developers (and LLMs) love

DuckDuckGo Search (`DuckSearch`)

`search_result(query, k=6)`

`today_new(category)`

Ultra-fast content extraction

Timeout budget

Caching

crawl4ai Deep Crawler (`Crawl`)

`get_url_llm(url, query)`

`get_summary(urls, query)`

Choosing the Right Mode