Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/JasonHonKL/spy-search/llms.txt

Use this file to discover all available pages before exploring further.

Spy Search’s web retrieval layer is split into two distinct modes: a fast DuckDuckGo path optimised for sub-1.5-second round trips, and a deep Playwright-based crawl path that trades latency for thoroughness. The right mode is chosen by which agent is active in the pipeline — Quick_searcher for speed, Search_agent for depth.

DuckDuckGo Search (DuckSearch)

File: src/browser/duckduckgo.py DuckSearch wraps langchain_community.tools.DuckDuckGoSearchResults and adds an async content-extraction layer with aggressive timeout budgets so that the overall search-to-result latency stays at or under 1.5 seconds.

search_result(query, k=6)

The primary search method. It:
  1. Calls DuckDuckGoSearchResults (text backend) to fetch the top-k results. Each result contains title, snippet, and link.
  2. Spawns async content-extraction tasks for each URL using aiohttp.
  3. Returns results enriched with a full_content field — a 300-character extract of the page’s first meaningful paragraph text.
results = duck.search_result("latest transformer architectures", k=6)
# Each result: {"title": "...", "snippet": "...", "link": "...", "full_content": "..."}

today_new(category)

Fetches the latest news headlines using the DuckDuckGo news backend. Accepts a category string ("technology", "finance", "entertainment", "sports", "world", "health") and returns up to 8 headlines.
headlines = duck.today_new("technology")

Ultra-fast content extraction

The _extract_content_fast(session, url) coroutine is the core of the speed strategy:
  • Reads only the first 20 KB of each response body — enough to capture the article lede without downloading the full page.
  • Extracts the first 15 <p> tags using selectolax, stopping early once 200 characters of usable text have been collected.
  • Falls back to a regex <p>…</p> scan if selectolax fails.
  • Trims the final output to 300 characters.

Timeout budget

All timeouts are tuned to keep the total wall-clock time under 1.5 seconds:
BudgetValueScope
Per-request connect100 msTCP connection
Per-request read300 msResponse body
Per-request total400 msFull single request
Gather timeout1.2 sAll concurrent extractions
Hard ceiling1.5 sEntire search_result() call
If the 1.2-second asyncio.gather deadline is exceeded, content extraction is abandoned and results are returned with full_content: "". If the overall call exceeds 1.5 seconds, search_result returns an empty list to fail fast rather than block the pipeline.

Caching

DuckSearch maintains two in-process caches to avoid redundant network calls within a session:
  • _content_cache — a plain dict capped at 100 entries (oldest 20 evicted when full) mapping URLs to their extracted text.
  • _failed_urls — a set of URLs that previously returned non-200 responses or raised exceptions; these are skipped immediately on future calls.
  • _is_valid_url — an @lru_cache(maxsize=1000) decorator on the URL validation helper.

crawl4ai Deep Crawler (Crawl)

File: src/browser/crawl_ai.py Crawl wraps the crawl4ai library, which drives a Playwright headless Chromium browser. It is suited for JavaScript-rendered pages, paginated search results, and cases where a plain HTTP fetch would not yield meaningful content.

get_url_llm(url, query)

Navigates to url (typically a search-engine results page like https://google.com/search?q=…) and uses an LLMExtractionStrategy to instruct the configured model to identify the top 5 relevant links on that page relative to query. Returns a list of {url, title, description} objects.
urls = await crawl.get_url_llm(
    "https://google.com/search?q=transformer+architecture",
    "transformer architecture"
)

get_summary(urls, query)

Fetches each URL in the list (using arun_many for concurrency) and instructs the LLM to extract:
FieldDescription
titleMain heading of the page
summary300–400 word detailed summary relevant to query
brief_summary2–3 sentence condensed summary
keywordsList of key phrases
urlCanonical page URL
PDF URLs are detected via Content-Type header (or the %PDF- magic bytes) and routed through get_pdf_summary(), which downloads the file, converts it to Markdown with markitdown, and summarises it with the LLM.

Choosing the Right Mode

ModeAgentTypical SpeedContent Depth
DuckDuckGoQuick_searcher~1.5 sSnippet + 300-char page extract
Deep crawl (crawl4ai)Search_agent10–30 sFull-page LLM summary (300–400 words)
Playwright requires browser binaries to be installed before the deep-crawl mode can function. Run playwright install chromium after installing Python dependencies. The provided Docker image and installation.sh setup script handle this step automatically.
Use the quick-searcher agent for interactive chat and real-time Q&A where latency matters. Switch to the searcher agent when generating in-depth research reports where content quality and completeness outweigh response time.

Build docs developers (and LLMs) love