Architecture

OpenGround is an on-device RAG (Retrieval-Augmented Generation) system designed to give AI agents controlled access to documentation. Everything runs locally - no external APIs, no data leaves your machine.

System Overview

OpenGround follows a pipeline architecture with three main stages:

      ┌─────────────────────────────────────────────────────────────────────┐
      │                           OPENGROUND                                │
      ├─────────────────────────────────────────────────────────────────────┤
      │                                                                     │
      │       SOURCE                  PROCESS              STORAGE/CLIENT   │
      │                                                                     │
      │    ┌──────────┐      ┌───────────┐   ┌──────────┐   ┌──────────┐    │
      │    │ git repo ├─────>│  Extract  ├──>│  Chunk   ├──>│ LanceDB  │    │
      │    |   -or-   |      │ (raw_data)│   │   Text   │   │ (vector  │    │
      │    │ sitemap  │      └───────────┘   └──────────┘   │  +BM25)  │    │
      │    │   -or-   │                           │         └────┬─────┘    │
      │    │ local dir│                           │              │          │
      │    └──────────┘                           │              │          │
      │                                           ▼              │          │
      │                                    ┌───────────┐         │          │
      │                                    │   Local   |<────────┘          │
      │                                    │ Embedding │         │          │
      │                                    │   Model   │         ▼          │
      │                                    └───────────┘  ┌─────────────┐   │
      │                                                   │ CLI / MCP   │   │
      │                                                   │  (hybrid    │   │
      │                                                   |   search)   |   │
      │                                                   └─────────────┘   │
      │                                                                     │
      └─────────────────────────────────────────────────────────────────────┘

Architecture Stages

1. Source Layer

The source layer handles documentation ingestion from multiple source types. See the Sources page for detailed information. Supported Sources:

Git Repositories: Clone and extract documentation from specific branches/tags
Sitemaps: Crawl and extract web documentation following sitemap.xml
Local Paths: Process documentation from local directories

Key Components:

extract/git.py: Handles git repository cloning with sparse checkout
extract/sitemap.py: Fetches and parses sitemaps, respects robots.txt
extract/local_path.py: Processes local file system paths
extract/common.py: Shared file processing logic

2. Processing Layer

The processing layer transforms raw documentation into searchable chunks.

Text Extraction

OpenGround supports multiple documentation formats:

# Handled in extract/common.py
def remove_front_matter(content: str) -> tuple[str, dict[str, str]]:
    """Parse YAML front matter and extract metadata"""
    if not content.startswith("---"):
        return content, {}
    # Parse front matter for title, description, etc.

Supported file types: .md, .mdx, .rst, .txt, .ipynb, .html, .htm

Document Chunking

Documents are split into overlapping chunks for better retrieval (from ingest.py:52-76):

from langchain_text_splitters import RecursiveCharacterTextSplitter

def chunk_document(page: ParsedPage) -> list[dict]:
    config = get_effective_config()
    chunk_size = config["embeddings"]["chunk_size"]        # Default: 800
    chunk_overlap = config["embeddings"]["chunk_overlap"]  # Default: 200
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, 
        chunk_overlap=chunk_overlap
    )
    chunks = splitter.split_text(page["content"])
    
    # Each chunk preserves metadata: url, title, version, library_name
    records = []
    for idx, chunk in enumerate(chunks):
        records.append({
            "url": page["url"],
            "library_name": page["library_name"],
            "version": page["version"],
            "title": page["title"],
            "content": chunk,
            "chunk_index": idx,
        })
    return records

Chunk overlap ensures that context isn’t lost at chunk boundaries, improving retrieval quality.

Embedding Generation

Each chunk is converted to a vector embedding using a local model. See Embeddings for details.

3. Storage Layer

OpenGround uses LanceDB for storing both vector embeddings and full-text search indices.

Why LanceDB?

Columnar storage: Efficient for vector operations
Built-in BM25: Full-text search without external dependencies
Local-first: No server setup required
PyArrow integration: Fast data serialization

Schema Structure

From ingest.py:163-177, the LanceDB table schema:

schema = pa.schema(
    [
        pa.field("url", pa.string()),
        pa.field("library_name", pa.string()),
        pa.field("version", pa.string()),
        pa.field("title", pa.string()),
        pa.field("description", pa.string()),
        pa.field("last_modified", pa.string()),
        pa.field("content", pa.string()),              # Text for BM25
        pa.field("chunk_index", pa.int64()),
        pa.field("vector", pa.list_(pa.float32(), 384)), # Embedding vector
    ],
    metadata={
        "embedding_backend": "fastembed",
        "embedding_model": "BAAI/bge-small-en-v1.5"
    }
)

The schema metadata tracks which embedding model was used, preventing incompatible searches.

Full-Text Index

After ingesting chunks, OpenGround creates a BM25 full-text search index (from ingest.py:223-226):

table.add(all_records)
table.create_fts_index("content", replace=True)

This enables hybrid search combining semantic similarity and keyword matching.

4. Query/Client Layer

The client layer exposes documentation through two interfaces:

CLI Commands

# Search documentation
openground query "how to configure embeddings" -l fastapi -v latest

# List available libraries
openground list

# Get library statistics
openground stats show

MCP Server

The Model Context Protocol (MCP) server exposes OpenGround to AI agents:

# From server.py
tools = [
    {"name": "search_documentation", ...},
    {"name": "list_libraries", ...},
    {"name": "get_full_content", ...}
]

AI agents can search documentation without polluting the main conversation context.

Data Flow Example

Let’s trace a complete flow from adding documentation to searching it:

Add Documentation

openground add fastapi \
  --source https://github.com/tiangolo/fastapi.git \
  --docs-path docs/ \
  --version v0.100.0 -y

Git extractor clones repo with sparse checkout
Filters for .md, .mdx files in docs/
Extracts content and metadata
Saves to ~/.local/share/openground/raw_data/fastapi/v0.100.0/

Chunk & Embed

Load parsed pages from raw_data directory
Split each page into 800-character chunks with 200-char overlap
Generate embeddings for all chunks (batch size: 32)
Store in LanceDB with metadata

# User query
query = "how to add dependencies"

# Generate query embedding
query_vec = generate_embeddings([query])[0]

# Hybrid search (vector + BM25)
results = table.search(query_type="hybrid")
               .text(query)
               .vector(query_vec)
               .where("version = 'v0.100.0'")
               .limit(5)
               .to_list()

Returns ranked results combining semantic similarity and keyword relevance.

Configuration

OpenGround’s behavior is controlled through a hierarchical configuration system (from config.py):

# ~/.config/openground/config.json
{
  "db_path": "~/.local/share/openground/lancedb",
  "table_name": "documents",
  "raw_data_dir": "~/.local/share/openground/raw_data",
  "extraction": {
    "concurrency_limit": 50
  },
  "embeddings": {
    "batch_size": 32,
    "chunk_size": 800,
    "chunk_overlap": 200,
    "embedding_model": "BAAI/bge-small-en-v1.5",
    "embedding_dimensions": 384,
    "embedding_backend": "fastembed"
  },
  "query": {
    "top_k": 5
  },
  "sources": {
    "auto_add_local": true
  }
}

XDG Compliance

OpenGround follows the XDG Base Directory Specification (from config.py:10-24):

Config: $XDG_CONFIG_HOME/openground or ~/.config/openground
Data: $XDG_DATA_HOME/openground or ~/.local/share/openground
Windows: Uses AppData/Local/openground

Component Isolation

Each component is designed for independence:

Extractors output standardized ParsedPage objects
Ingestion works with any ParsedPage source
Query operates on LanceDB tables regardless of source
Embedding backends are swappable (sentence-transformers ↔ fastembed)

This modularity enables:

Adding new source types without changing ingestion
Swapping embedding models without changing extraction
Independent testing of each component

Next Steps

Sources

Learn how OpenGround extracts documentation from git, sitemaps, and local paths

Embeddings

Understand embedding backends, models, and dimensions

Search

Deep dive into hybrid search with vector similarity and BM25

Configuration

Customize OpenGround’s behavior with config options

Get Started

Core Concepts

Adding Documentation

AI Agent Integration

Usage

Advanced

System Overview

Architecture Stages

1. Source Layer

2. Processing Layer

Text Extraction

Document Chunking

Embedding Generation

3. Storage Layer

Why LanceDB?

Schema Structure

Full-Text Index

4. Query/Client Layer

CLI Commands

MCP Server

Data Flow Example

Configuration

XDG Compliance

Component Isolation

Next Steps

Sources

Embeddings

Search

Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Adding Documentation

AI Agent Integration

Usage

Advanced

Documentation Index

​System Overview

​Architecture Stages

​1. Source Layer

​2. Processing Layer

​Text Extraction

​Document Chunking

​Embedding Generation

​3. Storage Layer

​Why LanceDB?

​Schema Structure

​Full-Text Index

​4. Query/Client Layer

​CLI Commands

​MCP Server

​Data Flow Example

​Configuration

​XDG Compliance

​Component Isolation

​Next Steps

Sources

Embeddings

Search

Configuration

Build docs developers (and LLMs) love

System Overview

Architecture Stages

1. Source Layer

2. Processing Layer

Text Extraction

Document Chunking

Embedding Generation

3. Storage Layer

Why LanceDB?

Schema Structure

Full-Text Index

4. Query/Client Layer

CLI Commands

MCP Server

Data Flow Example

Configuration

XDG Compliance

Component Isolation

Next Steps