Skip to main content
OpenGround is an on-device RAG (Retrieval-Augmented Generation) system designed to give AI agents controlled access to documentation. Everything runs locally - no external APIs, no data leaves your machine.

System Overview

OpenGround follows a pipeline architecture with three main stages:
      ┌─────────────────────────────────────────────────────────────────────┐
      │                           OPENGROUND                                │
      ├─────────────────────────────────────────────────────────────────────┤
      │                                                                     │
      │       SOURCE                  PROCESS              STORAGE/CLIENT   │
      │                                                                     │
      │    ┌──────────┐      ┌───────────┐   ┌──────────┐   ┌──────────┐    │
      │    │ git repo ├─────>│  Extract  ├──>│  Chunk   ├──>│ LanceDB  │    │
      │    |   -or-   |      │ (raw_data)│   │   Text   │   │ (vector  │    │
      │    │ sitemap  │      └───────────┘   └──────────┘   │  +BM25)  │    │
      │    │   -or-   │                           │         └────┬─────┘    │
      │    │ local dir│                           │              │          │
      │    └──────────┘                           │              │          │
      │                                           ▼              │          │
      │                                    ┌───────────┐         │          │
      │                                    │   Local   |<────────┘          │
      │                                    │ Embedding │         │          │
      │                                    │   Model   │         ▼          │
      │                                    └───────────┘  ┌─────────────┐   │
      │                                                   │ CLI / MCP   │   │
      │                                                   │  (hybrid    │   │
      │                                                   |   search)   |   │
      │                                                   └─────────────┘   │
      │                                                                     │
      └─────────────────────────────────────────────────────────────────────┘

Architecture Stages

1. Source Layer

The source layer handles documentation ingestion from multiple source types. See the Sources page for detailed information. Supported Sources:
  • Git Repositories: Clone and extract documentation from specific branches/tags
  • Sitemaps: Crawl and extract web documentation following sitemap.xml
  • Local Paths: Process documentation from local directories
Key Components:
  • extract/git.py: Handles git repository cloning with sparse checkout
  • extract/sitemap.py: Fetches and parses sitemaps, respects robots.txt
  • extract/local_path.py: Processes local file system paths
  • extract/common.py: Shared file processing logic

2. Processing Layer

The processing layer transforms raw documentation into searchable chunks.

Text Extraction

OpenGround supports multiple documentation formats:
# Handled in extract/common.py
def remove_front_matter(content: str) -> tuple[str, dict[str, str]]:
    """Parse YAML front matter and extract metadata"""
    if not content.startswith("---"):
        return content, {}
    # Parse front matter for title, description, etc.
Supported file types: .md, .mdx, .rst, .txt, .ipynb, .html, .htm

Document Chunking

Documents are split into overlapping chunks for better retrieval (from ingest.py:52-76):
from langchain_text_splitters import RecursiveCharacterTextSplitter

def chunk_document(page: ParsedPage) -> list[dict]:
    config = get_effective_config()
    chunk_size = config["embeddings"]["chunk_size"]        # Default: 800
    chunk_overlap = config["embeddings"]["chunk_overlap"]  # Default: 200
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, 
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_text(page["content"])
    
    # Each chunk preserves metadata: url, title, version, library_name
    records = []
    for idx, chunk in enumerate(chunks):
        records.append({
            "url": page["url"],
            "library_name": page["library_name"],
            "version": page["version"],
            "title": page["title"],
            "content": chunk,
            "chunk_index": idx,
        })
    return records
Chunk overlap ensures that context isn’t lost at chunk boundaries, improving retrieval quality.

Embedding Generation

Each chunk is converted to a vector embedding using a local model. See Embeddings for details.

3. Storage Layer

OpenGround uses LanceDB for storing both vector embeddings and full-text search indices.

Why LanceDB?

  • Columnar storage: Efficient for vector operations
  • Built-in BM25: Full-text search without external dependencies
  • Local-first: No server setup required
  • PyArrow integration: Fast data serialization

Schema Structure

From ingest.py:163-177, the LanceDB table schema:
schema = pa.schema(
    [
        pa.field("url", pa.string()),
        pa.field("library_name", pa.string()),
        pa.field("version", pa.string()),
        pa.field("title", pa.string()),
        pa.field("description", pa.string()),
        pa.field("last_modified", pa.string()),
        pa.field("content", pa.string()),              # Text for BM25
        pa.field("chunk_index", pa.int64()),
        pa.field("vector", pa.list_(pa.float32(), 384)), # Embedding vector
    ],
    metadata={
        "embedding_backend": "fastembed",
        "embedding_model": "BAAI/bge-small-en-v1.5"
    }
)
The schema metadata tracks which embedding model was used, preventing incompatible searches.

Full-Text Index

After ingesting chunks, OpenGround creates a BM25 full-text search index (from ingest.py:223-226):
table.add(all_records)
table.create_fts_index("content", replace=True)
This enables hybrid search combining semantic similarity and keyword matching.

4. Query/Client Layer

The client layer exposes documentation through two interfaces:

CLI Commands

# Search documentation
openground query "how to configure embeddings" -l fastapi -v latest

# List available libraries
openground list

# Get library statistics
openground stats show

MCP Server

The Model Context Protocol (MCP) server exposes OpenGround to AI agents:
# From server.py
tools = [
    {"name": "search_documentation", ...},
    {"name": "list_libraries", ...},
    {"name": "get_full_content", ...}
]
AI agents can search documentation without polluting the main conversation context.

Data Flow Example

Let’s trace a complete flow from adding documentation to searching it:
1

Add Documentation

openground add fastapi \
  --source https://github.com/tiangolo/fastapi.git \
  --docs-path docs/ \
  --version v0.100.0 -y
  1. Git extractor clones repo with sparse checkout
  2. Filters for .md, .mdx files in docs/
  3. Extracts content and metadata
  4. Saves to ~/.local/share/openground/raw_data/fastapi/v0.100.0/
2

Chunk & Embed

  1. Load parsed pages from raw_data directory
  2. Split each page into 800-character chunks with 200-char overlap
  3. Generate embeddings for all chunks (batch size: 32)
  4. Store in LanceDB with metadata
3

Search

# User query
query = "how to add dependencies"

# Generate query embedding
query_vec = generate_embeddings([query])[0]

# Hybrid search (vector + BM25)
results = table.search(query_type="hybrid")
               .text(query)
               .vector(query_vec)
               .where("version = 'v0.100.0'")
               .limit(5)
               .to_list()
Returns ranked results combining semantic similarity and keyword relevance.

Configuration

OpenGround’s behavior is controlled through a hierarchical configuration system (from config.py):
# ~/.config/openground/config.json
{
  "db_path": "~/.local/share/openground/lancedb",
  "table_name": "documents",
  "raw_data_dir": "~/.local/share/openground/raw_data",
  "extraction": {
    "concurrency_limit": 50
  },
  "embeddings": {
    "batch_size": 32,
    "chunk_size": 800,
    "chunk_overlap": 200,
    "embedding_model": "BAAI/bge-small-en-v1.5",
    "embedding_dimensions": 384,
    "embedding_backend": "fastembed"
  },
  "query": {
    "top_k": 5
  },
  "sources": {
    "auto_add_local": true
  }
}

XDG Compliance

OpenGround follows the XDG Base Directory Specification (from config.py:10-24):
  • Config: $XDG_CONFIG_HOME/openground or ~/.config/openground
  • Data: $XDG_DATA_HOME/openground or ~/.local/share/openground
  • Windows: Uses AppData/Local/openground

Component Isolation

Each component is designed for independence:
  • Extractors output standardized ParsedPage objects
  • Ingestion works with any ParsedPage source
  • Query operates on LanceDB tables regardless of source
  • Embedding backends are swappable (sentence-transformers ↔ fastembed)
This modularity enables:
  • Adding new source types without changing ingestion
  • Swapping embedding models without changing extraction
  • Independent testing of each component

Next Steps

Sources

Learn how OpenGround extracts documentation from git, sitemaps, and local paths

Embeddings

Understand embedding backends, models, and dimensions

Search

Deep dive into hybrid search with vector similarity and BM25

Configuration

Customize OpenGround’s behavior with config options

Build docs developers (and LLMs) love