Registering and ingesting document folders in Sherpa

The Ingest screen is where administrators register document folders and kick off the pipeline that transforms raw files into a searchable knowledge graph. Sherpa models its storage as a mirror of the filesystem: the structure you place on disk is the structure that becomes searchable — nothing is reorganised automatically, and nothing changes until you explicitly run an ingest. All ingest operations require the admin role when authentication is enabled.

What a world is

Every registered folder is called a world. A world is a one-to-one mapping between a directory on disk and a pair of data stores:

One Neo4j partition — the knowledge graph for that folder tree, holding all structural and semantic edges extracted from the files inside.
One Elasticsearch index — the full-text search index for the same scope.

The folder tree is the boundary. Files inside the registered root are in scope; everything outside is not. Subdirectories do not create separate worlds — they are sub-scopes of the same world and can be used as search or impact-analysis filters.

Supported file types

Extension	Treatment
`.md`, `.txt` and plain-text source	Indexed as-is into Elasticsearch. No conversion.
`.cbl`, `.cob`, `.cobol`	Parsed by static analyser. Structural edges (COPIES, INVOKES, ACCESSES, etc.) extracted. Raw source is preserved.
`.cpy` (copybooks)	Parsed for `CONTAINS → DataItem` edges. Used as the backbone of COBOL copy-chain analysis.
`.jcl`	Parsed for `Batch`, `INVOKES → Module`, and `ACCESSES → Dataset` edges.
`.xlsx`, `.docx`, `.pptx`	Converted deterministically to Markdown and stored in `data/derived/{world}/md/`. The Markdown derivative is indexed; the original binary is preserved and returned when a user downloads the source.
PDF and legacy binary	Not supported by default. PDF text-layer extraction is available with an optional backend. Scanned images (OCR) are not supported.

Office file conversion produces a derived Markdown copy that lives in data/derived/ and is used only for indexing. Users always download the original binary — the Markdown copy is invisible to end users.

Registering a world and running your first ingest

Open the document-folder registration screen

In the admin navigation, click 資料フォルダ (Document Folders). This panel lists all currently registered worlds and provides the registration form.

Enter the folder path

Type the absolute path to the folder you want to register. Windows paths (e.g. C:\projects\specs) are accepted — Sherpa resolves them internally as WSL paths (e.g. /mnt/c/projects/specs). The folder must already exist and be readable by the Sherpa process.

Submit the registration

Click 登録 (Register) or call POST /worlds with {"path": "/mnt/c/projects/specs"}. Sherpa assigns a world ID derived from the folder name, creates the Neo4j partition and Elasticsearch index, and immediately starts the first ingest run.

Monitor progress

Switch to the 取り込み状況 (Ingest Status) screen. Each document appears with a status indicator — 使えます (ready) or MD化 (converted from Office). Files that are unsupported are shown as an aggregate count, not individual rows.

Verify the graph

Open ナレッジグラフ (Knowledge Graph) to confirm that nodes and edges have been created for your files. If the folder contains COBOL or JCL, structural edges should be visible immediately. Semantic edges appear after the LLM extraction pass completes.

What happens during ingest

Each ingest run executes the following pipeline in order:

File scan and ledger update — Sherpa walks the registered directory and compares file hashes against the document ledger. Each document’s stable identifier (doc_id) is its path relative to the world root.
Office conversion — .xlsx, .docx, and .pptx files are converted to Markdown and written to data/derived/{world}/md/. Existing derived files are overwritten deterministically.
Static analysis (COBOL / JCL) — Source files are parsed to produce structural edges: COPIES, INVOKES, CONTAINS, ACCESSES, PRODUCED_BY, and related edge types. These edges carry extraction_method = "static" and are considered confirmed (●).
LLM semantic layer — Design documents and converted Office Markdown are passed to the configured LLM (OpenAI / Gemini / Ollama) for entity and relationship extraction. This produces business-layer nodes (BusinessRule, Function, Parameter, etc.) and semantic edges. These carry extraction_method = "llm" and are marked as requiring verification (○).
Elasticsearch index rebuild — The full-text index for the world is rebuilt from the current ledger. kuromoji morphological analysis is used for Japanese tokenisation.
Neo4j graph rebuild — The world’s subgraph is deleted and reloaded in a single transaction.

The Neo4j rebuild is atomic per world: all nodes and edges for the world are deleted, then the new graph is inserted in the same transaction. Partially-failed ingests never leave a mixed-state graph. Other worlds are not affected.

Ingest history

To review past runs, call GET /ingest/runs (optionally with ?version={world_id} to filter by world). Each entry includes the world ID, timestamps, final status (auto_published, auto_published_with_flags, or failed), ledger document count, and any warning flags.

GET /ingest/runs?version=my-world

Re-ingesting a world

Ingest is not automatic by default. If you update files in the registered folder — add, edit, or delete documents — those changes are not reflected until you explicitly trigger a new ingest.

On the 取り込み状況 screen, click 取り込み開始 (Start Ingest) to pick up changes, or やり直す (Re-ingest) to force a full clean rebuild from scratch regardless of change detection.

# Incremental refresh — picks up changes
POST /worlds/{wid}/refresh

# Full clean rebuild
POST /ingest/rerun
Content-Type: application/json

{ "version": "my-world" }

Polling mode: set SHERPA_POLL_SECONDS to a positive integer to enable automatic change detection. When polling is active, Sherpa checks the registered folders at the configured interval and runs an incremental refresh if changes are detected.

# In sherpa.env — poll every 5 minutes
SHERPA_POLL_SECONDS=300

Polling is off by default (SHERPA_POLL_SECONDS=0). Intentional re-ingests via the UI or POST /worlds/{wid}/refresh are a separate mechanism and work regardless of the polling setting.

Rebinding a world to a different folder

If you need to point an existing world at a different directory (e.g. after a file-server migration), use POST /worlds/{wid}/rebind with the new path. This operation destroys the existing world’s graph, index, and derived files, then recreates everything from the new folder. The world ID is preserved.

POST /worlds/my-world/rebind
Content-Type: application/json

{ "path": "/mnt/d/new-location/specs" }

rebind is currently API-only. There is no UI button for this operation.

Deleting a world

Deleting a world removes all of its derived artefacts from Sherpa:

The world registration record
The Neo4j subgraph (all nodes and edges scoped to that world)
The Elasticsearch index
The derived Markdown files in data/derived/{world}/
The document ledger entries

The registered external folder itself is never deleted — Sherpa only reads from it.

On the 資料フォルダ screen, click the delete button next to the world you want to remove and confirm the prompt.

DELETE /worlds/{wid}

World deletion is permanent and immediate. The graph, index, and all derived files are wiped in one operation. There is no recycle bin or undo. If you delete a world by mistake, you must re-register the folder and run a full ingest to rebuild everything.Deletion is also a fail-closed operation: Sherpa writes a pre-deletion audit entry before touching any data. If that audit write fails, the deletion is aborted.

Sherpa performs automatic orphan cleanup on ingest, deletion, and startup. Any derived files or index partitions that no longer correspond to a registered world are removed automatically — administrators do not need to perform manual housekeeping.

Getting Started

Using Sherpa

Administration

Deployment

Registering and ingesting document folders in Sherpa

What a world is

Supported file types

Registering a world and running your first ingest

What happens during ingest

Ingest history

Re-ingesting a world

Rebinding a world to a different folder

Deleting a world

Build docs developers (and LLMs) love

Getting Started

Using Sherpa

Administration

Deployment

Documentation Index

​What a world is

​Supported file types

​Registering a world and running your first ingest

​What happens during ingest

​Ingest history

​Re-ingesting a world

​Rebinding a world to a different folder

​Deleting a world

Build docs developers (and LLMs) love

What a world is

Supported file types

Registering a world and running your first ingest

What happens during ingest

Ingest history

Re-ingesting a world

Rebinding a world to a different folder

Deleting a world