Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pratyay360/searchapi/llms.txt

Use this file to discover all available pages before exploring further.

SearchAPI was built for one primary use case: collecting large volumes of relevant URLs across a domain to assemble LLM fine-tuning datasets. The workflow is straightforward — pick a topic, call the endpoints that match the content types you need, deduplicate the results, and feed the URLs into a downloader or content extractor. This guide walks through that workflow with concrete examples.

Strategy overview

A good fine-tuning dataset draws from multiple content types. Different endpoints serve different purposes:
EndpointContent typeBest for
GET /search/Web pagesGeneral knowledge, tutorials, documentation
GET /search/paperAcademic papers (DOIs)Technical depth, citations, research findings
GET /searchpdfs/PDF documentsReports, whitepapers, books in PDF form
GET /books/BooksLong-form content, structured knowledge
GET /news/News articlesCurrent events, recent developments
GET /search/specific/Any filetypeDomain-specific documents (docx, pptx, xlsx)
GET /repositories/GitHub + GitLab reposCode, READMEs, examples
Use GET /search/ (the general endpoint) rather than GET /search/engine when collecting at scale. The general endpoint rotates search engines randomly, distributing load and reducing the chance of any single engine blocking your requests.

Collect URLs with Python

The following script collects URLs across several content types for a given topic. It uses the requests library and writes deduplicated results to a file.
collect_dataset_urls.py
import requests
import time

BASE_URL = "http://localhost:8000"
TOPIC = "transformer architecture"
LIMIT = 20
DELAY = 2  # seconds between requests — be polite to upstream engines


def fetch(path: str, params: dict) -> list[str]:
    try:
        response = requests.get(f"{BASE_URL}{path}", params=params, timeout=30)
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        print(f"Request to {path} failed: {e}")
        return []


all_urls: list[str] = []

# General web pages
web = fetch("/search/", {"query": TOPIC, "limit": LIMIT})
print(f"Web: {len(web)} results")
all_urls.extend(web)
time.sleep(DELAY)

# Academic papers (returns DOIs — resolve to https://doi.org/<doi> for full URLs)
papers = fetch("/search/paper", {"query": TOPIC, "limit": LIMIT})
paper_urls = [f"https://doi.org/{doi}" for doi in papers if doi != "No URL"]
print(f"Papers: {len(paper_urls)} results")
all_urls.extend(paper_urls)
time.sleep(DELAY)

# PDFs
pdfs = fetch("/searchpdfs/", {"query": TOPIC, "limits": LIMIT})
print(f"PDFs: {len(pdfs)} results")
all_urls.extend(pdfs)
time.sleep(DELAY)

# News articles
news = fetch("/news/", {"query": TOPIC, "limit": LIMIT})
print(f"News: {len(news)} results")
all_urls.extend(news)
time.sleep(DELAY)

# Repositories
repos = fetch("/repositories/", {"query": TOPIC, "limit": LIMIT})
print(f"Repositories: {len(repos)} results")
all_urls.extend(repos)
time.sleep(DELAY)

# Deduplicate while preserving order
seen: set[str] = set()
unique_urls: list[str] = []
for url in all_urls:
    if url not in seen and url != "No URL":
        seen.add(url)
        unique_urls.append(url)

print(f"\nTotal unique URLs: {len(unique_urls)}")

with open("dataset_urls.txt", "w") as f:
    for url in unique_urls:
        f.write(url + "\n")

print("Saved to dataset_urls.txt")

Improve diversity with multiple queries

A single query rarely covers a domain fully. Run several related queries and combine the results.
multi_query.py
import requests
import time

BASE_URL = "http://localhost:8000"
DELAY = 2

QUERIES = [
    "transformer architecture",
    "self-attention mechanism",
    "BERT pre-training",
    "GPT language model",
    "vision transformer ViT",
]

all_urls: list[str] = []

for query in QUERIES:
    response = requests.get(
        f"{BASE_URL}/search/",
        params={"query": query, "limit": 15},
        timeout=30,
    )
    if response.ok:
        all_urls.extend(response.json())
        print(f"'{query}': {len(response.json())} results")
    time.sleep(DELAY)

unique_urls = list(dict.fromkeys(u for u in all_urls if u != "No URL"))
print(f"\nTotal unique URLs across all queries: {len(unique_urls)}")

Collect domain-specific documents

For structured documents — slide decks, reports, spreadsheets — use the filetype endpoint.
# Collect PowerPoint presentations on transformer architecture
curl "http://localhost:8000/search/specific/?query=transformer+architecture&filetype=pptx&limit=20"

# Collect Word documents
curl "http://localhost:8000/search/specific/?query=transformer+architecture&filetype=docx&limit=20"
filetype_collection.py
import requests

BASE_URL = "http://localhost:8000"
TOPIC = "transformer architecture"
FILETYPES = ["pptx", "docx", "pdf"]

for filetype in FILETYPES:
    response = requests.get(
        f"{BASE_URL}/search/specific/",
        params={"query": TOPIC, "filetype": filetype, "limit": 20},
        timeout=30,
    )
    if response.ok:
        urls = response.json()
        print(f"{filetype}: {len(urls)} results")
        for url in urls:
            print(f"  {url}")

Combine results and deduplicate

After collecting from multiple endpoints and queries, deduplicate before downloading content. The order-preserving approach below keeps the first occurrence of each URL.
deduplicate.py
def deduplicate(urls: list[str]) -> list[str]:
    seen: set[str] = set()
    result: list[str] = []
    for url in urls:
        if url not in seen and url != "No URL":
            seen.add(url)
            result.append(url)
    return result
The /search/paper endpoint returns DOIs, not direct URLs. Prepend https://doi.org/ to resolve them to full paper URLs, or use the DOI directly to fetch metadata from the Crossref API.

Example curl commands

# Web pages
curl "http://localhost:8000/search/?query=transformer+architecture&limit=20"

# Academic papers (returns DOIs)
curl "http://localhost:8000/search/paper?query=transformer+architecture&limit=20"

# PDFs
curl "http://localhost:8000/searchpdfs/?query=transformer+architecture&limits=20"

# Books
curl "http://localhost:8000/books/?query=transformer+architecture&limit=20"

# News
curl "http://localhost:8000/news/?query=transformer+architecture&limit=20"

# Repositories
curl "http://localhost:8000/repositories/?query=transformer+architecture&limit=20"

# Wikipedia and Wikimedia
curl "http://localhost:8000/wiki/?query=transformer+architecture&limit=20"

What to do with the URLs

Once you have a list of URLs, the next step is downloading and extracting content. Common tools include: Feed the extracted text into your fine-tuning pipeline after cleaning and formatting to your target schema.

Build docs developers (and LLMs) love