Sherpa architecture: stores, agents, and the ingest pipeline

Sherpa is built around a deliberate split: the three data stores run in Docker containers for reproducibility and isolation, while every part of the application core runs directly on the host. Understanding this boundary — and why it exists — is the key to understanding how Sherpa is deployed, extended, and operated.

Deployment model

Sherpa’s application core (FastAPI server, ingest pipeline, static analysis, Office→Markdown conversion, and AI agent execution) runs directly on the host and is never containerized. Only the three stores — PostgreSQL, Neo4j, and Elasticsearch — run in Docker.

The app core is intentionally host-native and this is a permanent architectural decision. Codex’s safety sandbox (systemd + Landlock + seccomp + bwrap) is host-native: wrapping the app in a standard Docker container causes the inner sandbox to collide with Docker’s namespace isolation, which either breaks Codex or requires --privileged mode — the opposite of the intended security posture. Additionally, the ingest pipeline scans network drives via WSL /mnt paths; containerization would break this access. If the native constraint were ever abandoned, both Codex safe execution and high-fidelity Office COM conversion would lose their justification.

Component overview

                          ┌─────────────────────────────────────────┐
                          │           HOST (Linux / WSL2)           │
                          │                                         │
  Browser ───────────────▶│  FastAPI  (port 8000)                   │
                          │     │                                   │
                          │     ├──▶  AI Agents                     │
                          │     │      ├── Codex (agentic grep)     │
                          │     │      ├── OpenAI  (function-call)  │
                          │     │      ├── Gemini  (function-call)  │
                          │     │      └── Ollama  (local LLM)      │
                          │     │                                   │
                          │     └──▶  Ingest pipeline               │
                          │           (scan → analyse → index)      │
                          └──────────────┬──────────────────────────┘
                                         │
                   ┌─────────────────────┼──────────────────────┐
                   │                     │                       │
                   ▼                     ▼                       ▼
          ┌────────────────┐  ┌──────────────────┐  ┌──────────────────┐
          │  PostgreSQL    │  │     Neo4j        │  │ Elasticsearch    │
          │  (Docker)      │  │     (Docker)     │  │   (Docker)       │
          │                │  │                  │  │                  │
          │ conversations  │  │ knowledge graph  │  │ BM25 full-text   │
          │ users/sessions │  │ per-world        │  │ per-world index  │
          │ audit log      │  │ partition        │  │ kuromoji         │
          │ world registry │  │                  │  │                  │
          │ file ledger    │  │                  │  │                  │
          └────────────────┘  └──────────────────┘  └──────────────────┘

Component roles

FastAPI (host)

The API server and static-file host for the web UI. Handles authentication, session management, conversation history, ingest triggering, and all routing between the browser, the stores, and the AI agents. Binds to 127.0.0.1:8000 by default; external access is expected to go through a reverse proxy.

PostgreSQL (Docker)

The single source of truth for relational data: conversations and messages, users and sessions, the audit log (hash-chain verified), the world registry (which folder is registered as which world), and the workspace file ledger. Data here is the primary backup target.

Neo4j (Docker)

The knowledge graph store. One graph partition per registered world. Holds structural and semantic edges (COPIES, INVOKES, CONTAINS, REALIZES, etc.) extracted during ingest. Powers impact-range analysis via reverse-neighbour traversal. Rebuilt atomically on every re-ingest.

Elasticsearch (Docker)

The full-text search index. One index per world, built with the analysis-kuromoji plugin for Japanese morphological analysis. Supports BM25 ranking and, optionally, vector embeddings. Each index carries a _meta.world_id tag so orphan indexes can be detected and cleaned up.

Codex agent (host)

An OpenAI Codex CLI instance run as an autonomous sub-agent on the host. Operates inside a workspace-write sandbox (read-only access to the KB filesystem, write access only to the user’s personal workspace). Used for agentic grep/search tasks.

OpenAI / Gemini / Ollama (host)

LLM providers used for function-calling agents (troubleshooting, specification queries). The agent calls grep, ES, and graph-neighbour tools iteratively to gather evidence, then synthesises an answer. Only plain text is sent externally; files are never uploaded to any AI provider.

Data directories

Location	Local development	Production	Purpose
`data/kb` / `SHERPA_KB_DIR`	`./data/kb`	`/srv/sherpa/kb`	Registered worlds — symlinked or direct references to source folders. Read-only for agents.
`data/derived` / `SHERPA_DERIVED_DIR`	`./data/derived/{world}/md/`	`/srv/sherpa/derived/`	Deterministic Markdown derivatives of Office files. Re-generatable; not a backup target.
`data/users` / `SHERPA_USERS_DIR`	`./data/users/{uid}/workspace`	`/srv/sherpa/users/{uid}/workspace`	Personal workspace files per user. Not indexed in ES or Neo4j. Only grep is applied here.

The data/ tree in local development is .gitignored. Derived files and ES/Neo4j indexes are fully regeneratable by re-running ingest. The only content that requires careful handling is the personal workspace (users/…/workspace) — it is the user’s own output, not a cache.

Ingest pipeline

Every registered world is processed by the same pipeline, triggered explicitly (via the admin UI, the /worlds/{id}/refresh endpoint, or SHERPA_POLL_SECONDS polling) rather than by continuous file-watching:

Registered folder
      │
      ▼
  1. Folder scan
     (build document ledger: doc_id = root-relative path)
      │
      ├──▶ 2. Static analysis
      │         structural edges from file content
      │         (COPIES / INVOKES / CONTAINS / …)
      │              │
      │              ▼
      │         3. LLM semantic extraction
      │              additional semantic edges
      │              (REALIZES / DOCUMENTS / RELATES_TO / …)
      │              │
      │              ▼
      │         Neo4j atomic rebuild
      │         (world partition wiped, then reloaded)
      │
      └──▶ 4. Office → Markdown conversion
                deterministic MD written to data/derived/{world}/md/
                (source citation always points to the original binary)
                     │
                     ▼
               5. ES index update
                  BM25 (+ optional vector) chunks indexed
                  original path preserved as doc_id

The graph rebuild is world-scoped: the existing partition for that world is dropped and recreated from the same scan snapshot so no stale edges survive a re-ingest.

Reconciliation and self-healing

Orphaned derived artifacts — ES indexes, Neo4j partitions, and data/derived/{world}/ directories — are automatically cleaned up whenever an ingest, delete, or startup event occurs. No manual reconciliation is needed. The cleanup is deliberately conservative: it does nothing if the world registry cannot be read reliably, if directory enumeration fails due to permission or I/O errors, or if the derived path is misconfigured to overlap the source folder. The implementation lives in sherpa/reconcile.py.

OS-layer multi-defence (production)

Production deployments run four dedicated Unix users, provisioned by scripts/setup-runtime-users.sh:

User	Role	KB filesystem access	Workspace access
`sherpa-ingest`	Ingest worker, cascade deletes	Read + write (sole writer)	—
`sherpa-agent`	Codex execution	Read-only (ownership blocks write)	Read + write
`sherpa-api`	FastAPI process	Read-only	Via agent
`sherpa-workspace`	Shared group	—	Write-enabled group

The sherpa-agent user is deliberately not a member of the sherpa-ingest group, making it physically impossible — not just policy-prohibited — for the agent process to modify the shared KB filesystem. This pairs with the Codex workspace-write sandbox (L3) and systemd hardening (ReadOnlyPaths=/srv/sherpa/kb, ProtectSystem=strict) to form a five-layer defence.

Getting Started

Using Sherpa

Administration

Deployment

Sherpa architecture: stores, agents, and the ingest pipeline

Deployment model

Component overview

Component roles

FastAPI (host)

PostgreSQL (Docker)

Neo4j (Docker)

Elasticsearch (Docker)

Codex agent (host)

OpenAI / Gemini / Ollama (host)

Data directories

Ingest pipeline

Reconciliation and self-healing

OS-layer multi-defence (production)

Build docs developers (and LLMs) love

Getting Started

Using Sherpa

Administration

Deployment

Documentation Index

​Deployment model

​Component overview

​Component roles

FastAPI (host)

PostgreSQL (Docker)

Neo4j (Docker)

Elasticsearch (Docker)

Codex agent (host)

OpenAI / Gemini / Ollama (host)

​Data directories

​Ingest pipeline

​Reconciliation and self-healing

​OS-layer multi-defence (production)

Build docs developers (and LLMs) love

Deployment model

Component overview

Component roles

Data directories

Ingest pipeline

Reconciliation and self-healing

OS-layer multi-defence (production)