Neo4j: Knowledge Graph Storage for Medical Entities

Neo4j is the graph database that stores the entity and relationship knowledge graph built from medical documents. It powers the relational channel of the agentic pipeline, enabling SPO (Subject–Predicate–Object) triple-based retrieval over structured clinical knowledge. By persisting named entities and their pairwise connections, Neo4j gives the system a dedicated store for structured medical reasoning — separate from, but complementary to, the dense vector index used by the semantic channel.

Role in the System

The agentic pipeline interacts with Neo4j at two distinct stages of the data lifecycle: during document indexing and during multi-hop retrieval. During indexing, an LLM processes each document chunk and extracts named entities alongside their pairwise relationships. Those entities and relationships are written directly into Neo4j, building up a structured knowledge graph that grows with every new document ingested. During retrieval, the relational channel issues SPO triple queries against Neo4j. Each triple encodes a (Subject, Predicate, Object) pattern that the graph database resolves by traversing stored nodes and edges. The resulting triplets are then summarised and combined with the semantic channel’s output before the final answer is synthesised.

Only LightRAG uses Neo4j as its graph storage backend. MiniRAG, PathRAG, and HyperGraphRAG store their entity and relationship graphs locally in the working directory and do not require a running Neo4j instance.

Setup

The quickest way to get Neo4j running locally is with Docker.

Start the Neo4j container

Run the following command to pull and start a Neo4j instance with default authentication. The standard Neo4j Docker ports are 7474 (HTTP browser) and 7687 (Bolt protocol):

docker run --name neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password \
  neo4j:latest

Verify the instance

Once the container is running, you can access the Neo4j Browser at:

http://localhost:7474

The Bolt protocol — used by the application driver — is exposed on:

bolt://localhost:7687

For production deployments, consider using the official Neo4j Helm chart for Kubernetes or the managed Neo4j AuraDB cloud service.

Configuration

The pipeline connects to Neo4j using environment variables for the Bolt URI, username, and password. The table below shows typical configuration variable names — set these in your shell or in a .env file and never hardcode credentials in source code or commit them to version control.

Variable	Description	Example
`NEO4J_URI`	Bolt connection URI for the Neo4j instance	`bolt://localhost:7687`
`NEO4J_USERNAME`	Neo4j database username	`neo4j`
`NEO4J_PASSWORD`	Neo4j database password	`password`

Do not hardcode Neo4j credentials in your source code. Always supply them through environment variables or a dedicated secrets manager such as AWS Secrets Manager or HashiCorp Vault.

Data Model

The Neo4j graph used by the agentic pipeline follows a straightforward entity–relationship schema derived directly from the LLM extraction step. Nodes represent named entities extracted from medical documents by the LLM during indexing. Each node corresponds to a named medical concept identified in the source text — such as a disease, drug, symptom, or biomarker. Relationships represent pairwise connections between entities — the edges that make the graph navigable. The LLM extracts these relationships directly from document text, encoding the clinical or biological link between two named entities. Vector index — alongside the graph structure, LightRAG stores dense embeddings in a vector index. These embeddings sit next to the raw document chunks and are used by the semantic retrieval channel during hybrid search, ensuring both structured and unstructured knowledge are accessible from within the same indexing pass.

If you are using a backend other than LightRAG (i.e., MiniRAG, PathRAG, or HyperGraphRAG), Neo4j is not required. Those backends persist their graphs locally and have no external graph database dependency.

Get Started

Concepts

Backends

Storage & Infrastructure

Evaluation

Role in the System

Setup

Configuration

Data Model

Build docs developers (and LLMs) love

Get Started

Concepts

Backends

Storage & Infrastructure

Evaluation

Documentation Index

​Role in the System

​Setup

​Configuration

​Data Model

Build docs developers (and LLMs) love

Role in the System

Setup

Configuration

Data Model