Skip to main content
OpenGround uses hybrid search combining two complementary techniques: vector similarity (semantic search) and BM25 (keyword search). This approach finds relevant documentation even when queries use different terminology than the source material.

Hybrid Search Architecture

Why Hybrid?

Strengths:
  • Finds conceptually similar content
  • Works with synonyms and paraphrasing
  • Understands context and intent
Example:
Query: "how to install packages"
Matches:
- "Adding dependencies to your project"
- "Package management guide"
- "Setting up requirements.txt"
Weaknesses:
  • May miss exact technical terms
  • Can retrieve overly broad matches

Query Flow

Let’s trace a search query through OpenGround’s system.

1. Query Input

From the CLI or MCP server:
openground query "how to configure embeddings" -l openground -v latest
Or from an AI agent via MCP:
{
  "tool": "search_documentation",
  "arguments": {
    "query": "how to configure embeddings",
    "library_name": "openground",
    "version": "latest"
  }
}

2. Query Embedding

From query.py:94, the query is converted to a vector:
from openground.embeddings import generate_embeddings

query_vec = generate_embeddings([query], show_progress=show_progress)[0]
# Returns: [0.23, -0.15, 0.87, ...] (384 dimensions)
Query embedding uses the same model as document embedding, ensuring they’re in the same vector space.

3. Hybrid Search Execution

From query.py:96-104, LanceDB performs the hybrid search:
def search(
    query: str,
    version: str,
    db_path: Path = DEFAULT_DB_PATH,
    table_name: str = DEFAULT_TABLE_NAME,
    library_name: Optional[str] = None,
    top_k: int = 10,
    show_progress: bool = True,
) -> str:
    table = _get_table(db_path, table_name)
    query_vec = generate_embeddings([query], show_progress=show_progress)[0]
    
    # Build hybrid search
    search_builder = table.search(query_type="hybrid") \
                          .text(query) \      # BM25 component
                          .vector(query_vec)   # Vector component
    
    # Apply filters
    safe_version = _escape_sql_string(version)
    search_builder = search_builder.where(f"version = '{safe_version}'")
    
    if library_name:
        safe_name = _escape_sql_string(library_name)
        search_builder = search_builder.where(f"library_name = '{safe_name}'")
    
    # Execute and return top K
    results = search_builder.limit(top_k).to_list()
1

Query Type: Hybrid

search_builder = table.search(query_type="hybrid")
Tells LanceDB to combine vector and BM25 search.
2

BM25 Component

.text(query)
Performs keyword search using the full-text index on the content field.
3

Vector Component

.vector(query_vec)
Performs cosine similarity search in the vector space.
4

Metadata Filtering

.where(f"version = '{safe_version}'")
.where(f"library_name = '{safe_name}'")
Filters results to specific library/version before ranking.
5

Limit Results

.limit(top_k)
Returns only the top K highest-scoring chunks.

4. Result Ranking

LanceDB internally combines scores from both search types:
# Simplified conceptual model (actual implementation is in LanceDB)
for chunk in chunks:
    # Vector similarity (cosine)
    vector_score = cosine_similarity(query_vec, chunk.vector)
    
    # BM25 score
    bm25_score = bm25(query_text, chunk.content)
    
    # Combine (LanceDB uses learned fusion)
    combined_score = merge_scores(vector_score, bm25_score)
    
    chunk.score = combined_score

# Sort by combined score and return top K
results = sorted(chunks, key=lambda c: c.score, reverse=True)[:top_k]
LanceDB’s hybrid search uses sophisticated score fusion techniques. The exact algorithm is internal to LanceDB.

5. Result Formatting

From query.py:110-134, results are formatted for the user:
if not results:
    return "Found 0 matches."

lines = [f"Found {len(results)} match{'es' if len(results) != 1 else ''}."]
for idx, item in enumerate(results, start=1):
    title = item.get("title") or "(no title)"
    snippet = (item.get("content") or "").strip()
    source = item.get("url") or "unknown"
    item_version = item.get("version") or version
    score = item.get("_distance") or item.get("_score")
    
    score_str = ""
    if isinstance(score, (int, float)):
        score_str = f", score={score:.4f}"
    
    # Embed tool call hint for fetching full content
    tool_hint = json.dumps(
        {"tool": "get_full_content", "url": source, "version": item_version}
    )
    
    lines.append(
        f'{idx}. **{title}**: "{snippet}" (Source: {source}, Version: {item_version}{score_str})\n'
        f"   To get full page content: {tool_hint}"
    )

return "\n".join(lines)
Example output:
Found 3 matches.
1. **Configuration**: "OpenGround's behavior is controlled through a hierarchical..." (Source: https://github.com/user/repo/docs/config.md, Version: latest, score=0.8234)
   To get full page content: {"tool": "get_full_content", "url": "https://...", "version": "latest"}
2. **Embedding Settings**: "You can configure the embedding model and backend..." (Source: https://github.com/user/repo/docs/embeddings.md, Version: latest, score=0.7891)
   To get full page content: {"tool": "get_full_content", "url": "https://...", "version": "latest"}
...
BM25 (Best Matching 25) is a probabilistic ranking function for keyword search.

BM25 Index Creation

From ingest.py:223-226, the full-text index is created after ingestion:
table.add(all_records)  # Add chunks with embeddings

try:
    table.create_fts_index("content", replace=True)
except Exception as exc:
    print(f"FTS index creation skipped: {exc}")
The content field is indexed for full-text search. This enables BM25 scoring on chunk text.

How BM25 Works

BM25 ranks documents based on:
How often a query term appears in the document.
# Simplified
tf = count(term, document) / len(document)

# Example
Query: "embeddings"
Doc A: "embeddings" appears 5 times in 100 words → high TF
Doc B: "embeddings" appears 1 time in 100 words → low TF
Saturation: BM25 uses diminishing returns - 5 mentions isn’t 5x better than 1.

BM25 Example

# Query: "configure embedding model"

# Document A: "To configure the embedding model, use config.json..."
# - "configure": 1 occurrence, medium IDF
# - "embedding": 1 occurrence, low IDF (common in docs)
# - "model": 1 occurrence, low IDF (common)
# BM25 Score: 3.2

# Document B: "The embedding model configuration allows you to..."
# - "embedding": 1 occurrence
# - "model": 1 occurrence  
# - "configuration": 1 occurrence (synonym of "configure")
# BM25 Score: 2.1 (lower - missed "configure")

# Vector search might rank B higher due to semantic similarity,
# but BM25 ensures A gets boosted for exact term match.
Vector search finds chunks with embeddings close to the query embedding.

Cosine Similarity

From mathematical perspective:
import numpy as np

def cosine_similarity(vec_a, vec_b):
    """Compute cosine similarity between two vectors."""
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    return dot_product / (norm_a * norm_b)

# Example
query_vec = [0.5, 0.3, 0.8, ...]  # 384 dimensions
chunk_vec = [0.6, 0.2, 0.7, ...]  # 384 dimensions

similarity = cosine_similarity(query_vec, chunk_vec)
# Returns: 0.92 (very similar)
Cosine similarity measures the angle between vectors, not their magnitude. Values range from -1 (opposite) to 1 (identical).

Normalized Embeddings

From embeddings.py:154, embeddings are normalized:
batch_embeddings = model.encode(
    sentences=batch,
    normalize_embeddings=True,  # L2 normalization
    ...
)
Normalization benefits:
  • Embeddings have unit length (magnitude = 1)
  • Cosine similarity simplifies to dot product
  • Faster computation: similarity = dot(a, b) instead of dot(a, b) / (norm(a) * norm(b))

Approximate Nearest Neighbor

LanceDB uses ANN (Approximate Nearest Neighbor) indexes for fast vector search:
# Exact search (slow for large datasets)
for chunk in all_chunks:
    scores.append(cosine_similarity(query_vec, chunk.vector))
results = top_k(scores)
# O(n) where n = number of chunks

# ANN search (fast)
results = ann_index.search(query_vec, k=top_k)
# O(log n) with high accuracy
LanceDB automatically builds ANN indexes for the vector field.

Metadata Filtering

From query.py:98-103, filters are applied before ranking:
# SQL-style WHERE clauses
search_builder = search_builder.where(f"version = '{safe_version}'")

if library_name:
    search_builder = search_builder.where(f"library_name = '{safe_name}'")

Why Filter First?

# Bad: Search all, then filter
results = search_all_chunks(query)  # 1M chunks
filtered = [r for r in results if r.version == "v1.0.0"]  # 1K chunks
# Wastes time searching irrelevant chunks

# Good: Filter, then search
filtered_chunks = chunks.where("version = 'v1.0.0'")  # 1K chunks
results = search(filtered_chunks, query)
# Only searches relevant chunks

SQL Injection Prevention

From query.py:46-65, user input is escaped:
def _escape_sql_string(value: str) -> str:
    """Escape a string value for safe use in LanceDB SQL WHERE clauses."""
    # Remove null bytes
    value = value.replace("\x00", "")
    # Escape backslashes first
    value = value.replace("\\", "\\\\")
    # Escape single quotes (SQL standard: ' becomes '')
    value = value.replace("'", "''")
    return value

# Usage
safe_version = _escape_sql_string(version)  # "v'1.0.0" → "v''1.0.0"
search_builder.where(f"version = '{safe_version}'")
Always escape user input in SQL WHERE clauses to prevent injection attacks.

Retrieving Full Content

Search results contain chunk content (800 chars). To get the full page, use get_full_content (from query.py:211-251):
def get_full_content(
    url: str,
    version: str,
    db_path: Path = DEFAULT_DB_PATH,
    table_name: str = DEFAULT_TABLE_NAME,
) -> str:
    """Retrieve the full content of a document by its URL and version."""
    table = _get_table(db_path, table_name)
    
    # Query all chunks for this URL and version
    safe_url = _escape_sql_string(url)
    safe_version = _escape_sql_string(version)
    df = (
        table.search()
        .where(f"url = '{safe_url}' AND version = '{safe_version}'")
        .select(["title", "content", "chunk_index"])
        .to_pandas()
    )
    
    if df.empty:
        return f"No content found for URL: {url} (version: {version})"
    
    # Sort by chunk_index and concatenate content
    df = df.sort_values("chunk_index")
    full_content = "\n\n".join(df["content"].tolist())
    
    title = df.iloc[0].get("title", "(no title)")
    return f"# {title}\n\nSource: {url}\nVersion: {version}\n\n{full_content}"
1

Query All Chunks

Find all chunks belonging to the same URL and version.
2

Sort by Chunk Index

Ensure chunks are in original order (chunk_index: 0, 1, 2, …).
3

Concatenate Content

Join chunk content with double newlines to preserve formatting.
4

Format as Markdown

Return complete page with title, source, and full content.

Query Caching

From query.py:12-15, database connections are cached:
_db_cache: dict[str, Any] = {}
_table_cache: dict[tuple[str, str], Any] = {}
_metadata_cache: dict[tuple[str, str], dict[str, Any]] = {}

def _get_db(db_path: Path) -> "lancedb.DBConnection":
    """Get a cached database connection."""
    path_str = str(db_path)
    if path_str not in _db_cache:
        _db_cache[path_str] = lancedb.connect(path_str)
    return _db_cache[path_str]

def _get_table(db_path: Path, table_name: str) -> Optional["lancedb.table.Table"]:
    """Get a cached table handle."""
    cache_key = (str(db_path), table_name)
    if cache_key not in _table_cache:
        db = _get_db(db_path)
        if table_name not in db.table_names():
            return None
        _table_cache[cache_key] = db.open_table(table_name)
    return _table_cache[cache_key]
Caching avoids reconnecting to the database for every query. Especially important for MCP server which handles many sequential requests.

Search Configuration

From config.py:69, default top K:
DEFAULT_TOP_K = 5
Configure with:
# Return more results by default
openground config set query.top_k 10

# Or specify per query
openground query "my query" --top-k 20
Choosing top K:
  • Small (3-5): Precise, focused results for AI agents
  • Medium (10-15): Good for exploratory queries
  • Large (20+): Comprehensive coverage, but may include noise

Performance Characteristics

# Typical latency breakdown
Query embedding:     50-200ms  (depends on model/hardware)
Vector search:       10-50ms   (ANN index)
BM25 search:         5-20ms    (full-text index)
Score fusion:        1-5ms     (LanceDB internal)
Metadata filtering:  <1ms      (indexed columns)
Total:               ~100-300ms
Factors:
  • Embedding model speed (GPU vs CPU)
  • Number of chunks in database
  • Complexity of filters

Search Quality Tips

Good queries:
  • “how to configure embeddings”
  • “FastAPI dependency injection”
  • “error handling best practices”
Poor queries:
  • “stuff” (too vague)
  • “asdfasdf” (gibberish)
  • Single words without context
BM25 rewards exact matches:
# Good: Uses specific API name
"openground config set embeddings.embedding_model"

# Less good: Vague
"change settings for vectors"
Always specify version for accurate results:
# Good: Specific version
openground query "new features" -l fastapi -v v0.100.0

# Risk: Might mix results from old versions
openground query "new features" -l fastapi -v latest
Search results are chunks (800 chars). For complete context:
# 1. Search to find relevant page
results = search("installation guide")

# 2. Get full content of the page
full_page = get_full_content(results[0].url, results[0].version)

Next Steps

Architecture

See how search fits into OpenGround’s architecture

Embeddings

Deep dive into the vector embeddings powering semantic search

Sources

Learn what documentation can be searched

CLI Reference

Complete reference for the query command

Build docs developers (and LLMs) love