The Ingest screen is where administrators register document folders and kick off the pipeline that transforms raw files into a searchable knowledge graph. Sherpa models its storage as a mirror of the filesystem: the structure you place on disk is the structure that becomes searchable — nothing is reorganised automatically, and nothing changes until you explicitly run an ingest. All ingest operations require theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/tudoumono/Sherpa/llms.txt
Use this file to discover all available pages before exploring further.
admin role when authentication is enabled.
What a world is
Every registered folder is called a world. A world is a one-to-one mapping between a directory on disk and a pair of data stores:- One Neo4j partition — the knowledge graph for that folder tree, holding all structural and semantic edges extracted from the files inside.
- One Elasticsearch index — the full-text search index for the same scope.
Supported file types
| Extension | Treatment |
|---|---|
.md, .txt and plain-text source | Indexed as-is into Elasticsearch. No conversion. |
.cbl, .cob, .cobol | Parsed by static analyser. Structural edges (COPIES, INVOKES, ACCESSES, etc.) extracted. Raw source is preserved. |
.cpy (copybooks) | Parsed for CONTAINS → DataItem edges. Used as the backbone of COBOL copy-chain analysis. |
.jcl | Parsed for Batch, INVOKES → Module, and ACCESSES → Dataset edges. |
.xlsx, .docx, .pptx | Converted deterministically to Markdown and stored in data/derived/{world}/md/. The Markdown derivative is indexed; the original binary is preserved and returned when a user downloads the source. |
| PDF and legacy binary | Not supported by default. PDF text-layer extraction is available with an optional backend. Scanned images (OCR) are not supported. |
Office file conversion produces a derived Markdown copy that lives in
data/derived/ and is used only for indexing. Users always download the original binary — the Markdown copy is invisible to end users.Registering a world and running your first ingest
Open the document-folder registration screen
In the admin navigation, click 資料フォルダ (Document Folders). This panel lists all currently registered worlds and provides the registration form.
Enter the folder path
Type the absolute path to the folder you want to register. Windows paths (e.g.
C:\projects\specs) are accepted — Sherpa resolves them internally as WSL paths (e.g. /mnt/c/projects/specs). The folder must already exist and be readable by the Sherpa process.Submit the registration
Click 登録 (Register) or call
POST /worlds with {"path": "/mnt/c/projects/specs"}. Sherpa assigns a world ID derived from the folder name, creates the Neo4j partition and Elasticsearch index, and immediately starts the first ingest run.Monitor progress
Switch to the 取り込み状況 (Ingest Status) screen. Each document appears with a status indicator — 使えます (ready) or MD化 (converted from Office). Files that are unsupported are shown as an aggregate count, not individual rows.
What happens during ingest
Each ingest run executes the following pipeline in order:- File scan and ledger update — Sherpa walks the registered directory and compares file hashes against the document ledger. Each document’s stable identifier (
doc_id) is its path relative to the world root. - Office conversion —
.xlsx,.docx, and.pptxfiles are converted to Markdown and written todata/derived/{world}/md/. Existing derived files are overwritten deterministically. - Static analysis (COBOL / JCL) — Source files are parsed to produce structural edges:
COPIES,INVOKES,CONTAINS,ACCESSES,PRODUCED_BY, and related edge types. These edges carryextraction_method = "static"and are considered confirmed (●). - LLM semantic layer — Design documents and converted Office Markdown are passed to the configured LLM (OpenAI / Gemini / Ollama) for entity and relationship extraction. This produces business-layer nodes (
BusinessRule,Function,Parameter, etc.) and semantic edges. These carryextraction_method = "llm"and are marked as requiring verification (○). - Elasticsearch index rebuild — The full-text index for the world is rebuilt from the current ledger. kuromoji morphological analysis is used for Japanese tokenisation.
- Neo4j graph rebuild — The world’s subgraph is deleted and reloaded in a single transaction.
The Neo4j rebuild is atomic per world: all nodes and edges for the world are deleted, then the new graph is inserted in the same transaction. Partially-failed ingests never leave a mixed-state graph. Other worlds are not affected.
Ingest history
To review past runs, callGET /ingest/runs (optionally with ?version={world_id} to filter by world). Each entry includes the world ID, timestamps, final status (auto_published, auto_published_with_flags, or failed), ledger document count, and any warning flags.
Re-ingesting a world
Ingest is not automatic by default. If you update files in the registered folder — add, edit, or delete documents — those changes are not reflected until you explicitly trigger a new ingest.- UI
- API
On the 取り込み状況 screen, click 取り込み開始 (Start Ingest) to pick up changes, or やり直す (Re-ingest) to force a full clean rebuild from scratch regardless of change detection.
SHERPA_POLL_SECONDS to a positive integer to enable automatic change detection. When polling is active, Sherpa checks the registered folders at the configured interval and runs an incremental refresh if changes are detected.
SHERPA_POLL_SECONDS=0). Intentional re-ingests via the UI or POST /worlds/{wid}/refresh are a separate mechanism and work regardless of the polling setting.
Rebinding a world to a different folder
If you need to point an existing world at a different directory (e.g. after a file-server migration), usePOST /worlds/{wid}/rebind with the new path. This operation destroys the existing world’s graph, index, and derived files, then recreates everything from the new folder. The world ID is preserved.
rebind is currently API-only. There is no UI button for this operation.Deleting a world
Deleting a world removes all of its derived artefacts from Sherpa:- The world registration record
- The Neo4j subgraph (all nodes and edges scoped to that world)
- The Elasticsearch index
- The derived Markdown files in
data/derived/{world}/ - The document ledger entries
- UI
- API
On the 資料フォルダ screen, click the delete button next to the world you want to remove and confirm the prompt.