Spy Search’s web retrieval layer is split into two distinct modes: a fast DuckDuckGo path optimised for sub-1.5-second round trips, and a deep Playwright-based crawl path that trades latency for thoroughness. The right mode is chosen by which agent is active in the pipeline —Documentation Index
Fetch the complete documentation index at: https://mintlify.com/JasonHonKL/spy-search/llms.txt
Use this file to discover all available pages before exploring further.
Quick_searcher for speed, Search_agent for depth.
DuckDuckGo Search (DuckSearch)
File: src/browser/duckduckgo.py
DuckSearch wraps langchain_community.tools.DuckDuckGoSearchResults and adds an async content-extraction layer with aggressive timeout budgets so that the overall search-to-result latency stays at or under 1.5 seconds.
search_result(query, k=6)
The primary search method. It:
- Calls
DuckDuckGoSearchResults(text backend) to fetch the top-kresults. Each result containstitle,snippet, andlink. - Spawns async content-extraction tasks for each URL using
aiohttp. - Returns results enriched with a
full_contentfield — a 300-character extract of the page’s first meaningful paragraph text.
today_new(category)
Fetches the latest news headlines using the DuckDuckGo news backend. Accepts a category string ("technology", "finance", "entertainment", "sports", "world", "health") and returns up to 8 headlines.
Ultra-fast content extraction
The_extract_content_fast(session, url) coroutine is the core of the speed strategy:
- Reads only the first 20 KB of each response body — enough to capture the article lede without downloading the full page.
- Extracts the first 15
<p>tags using selectolax, stopping early once 200 characters of usable text have been collected. - Falls back to a regex
<p>…</p>scan if selectolax fails. - Trims the final output to 300 characters.
Timeout budget
All timeouts are tuned to keep the total wall-clock time under 1.5 seconds:| Budget | Value | Scope |
|---|---|---|
| Per-request connect | 100 ms | TCP connection |
| Per-request read | 300 ms | Response body |
| Per-request total | 400 ms | Full single request |
| Gather timeout | 1.2 s | All concurrent extractions |
| Hard ceiling | 1.5 s | Entire search_result() call |
asyncio.gather deadline is exceeded, content extraction is abandoned and results are returned with full_content: "". If the overall call exceeds 1.5 seconds, search_result returns an empty list to fail fast rather than block the pipeline.
Caching
DuckSearch maintains two in-process caches to avoid redundant network calls within a session:
_content_cache— a plaindictcapped at 100 entries (oldest 20 evicted when full) mapping URLs to their extracted text._failed_urls— asetof URLs that previously returned non-200 responses or raised exceptions; these are skipped immediately on future calls._is_valid_url— an@lru_cache(maxsize=1000)decorator on the URL validation helper.
crawl4ai Deep Crawler (Crawl)
File: src/browser/crawl_ai.py
Crawl wraps the crawl4ai library, which drives a Playwright headless Chromium browser. It is suited for JavaScript-rendered pages, paginated search results, and cases where a plain HTTP fetch would not yield meaningful content.
get_url_llm(url, query)
Navigates to url (typically a search-engine results page like https://google.com/search?q=…) and uses an LLMExtractionStrategy to instruct the configured model to identify the top 5 relevant links on that page relative to query. Returns a list of {url, title, description} objects.
get_summary(urls, query)
Fetches each URL in the list (using arun_many for concurrency) and instructs the LLM to extract:
| Field | Description |
|---|---|
title | Main heading of the page |
summary | 300–400 word detailed summary relevant to query |
brief_summary | 2–3 sentence condensed summary |
keywords | List of key phrases |
url | Canonical page URL |
Content-Type header (or the %PDF- magic bytes) and routed through get_pdf_summary(), which downloads the file, converts it to Markdown with markitdown, and summarises it with the LLM.
Choosing the Right Mode
| Mode | Agent | Typical Speed | Content Depth |
|---|---|---|---|
| DuckDuckGo | Quick_searcher | ~1.5 s | Snippet + 300-char page extract |
| Deep crawl (crawl4ai) | Search_agent | 10–30 s | Full-page LLM summary (300–400 words) |
Playwright requires browser binaries to be installed before the deep-crawl mode can function. Run
playwright install chromium after installing Python dependencies. The provided Docker image and installation.sh setup script handle this step automatically.