Ontology-Driven End-to-End GraphRAG Pipeline in Neo4j

Session 31 (Season 2, Episode 4 — December 2024) assembles all the pieces from the preceding sessions into a single, continuous pipeline: start with an OWL ontology, extract a knowledge graph from documents, build a vector index on top of it, and then answer natural-language questions using graph-augmented retrieval. This is the first session in the series to demonstrate the complete GraphRAG loop — from raw documents all the way to an LLM-generated answer informed by structured graph data.

Watch the Recording

Full live-stream replay on YouTube

Session Code

Python utilities and pipeline scripts

Pipeline Architecture

The end-to-end pipeline consists of four stages that flow from design-time artefacts (the ontology) through run-time retrieval (RAG):

[1] OWL Ontology Design
          │
          ▼
[2] KG Construction from Documents
    (ontology-guided LLM extraction → Neo4j)
          │
          ▼
[3] Vector Index Creation
    (embed node text properties → Neo4j vector index)
          │
          ▼
[4] GraphRAG Retrieval + LLM Answer Generation
    (vector search → graph traversal → LLM)

The `utils.py` Module

Session 31’s utils.py provides the schema translation utilities that connect the ontology layer to the neo4j-graphrag library. It converts an OWL Graph (loaded with RDFLib) into the SchemaConfig objects that neo4j-graphrag’s SimpleKGPipeline and retrieval components understand.

Helper: `getLocalPart()`

Extracts the local name from a full URI — strips everything up to and including the last #, /, or ::

from rdflib import Graph
from rdflib.namespace import RDF, OWL, RDFS
from neo4j_graphrag.experimental.components.schema import (
    SchemaBuilder, SchemaEntity, SchemaProperty, SchemaRelation, SchemaConfig
)

def getLocalPart(uri):
    pos = uri.rfind('#')
    if pos < 0:
        pos = uri.rfind('/')
    if pos < 0:
        pos = uri.rindex(':')
    return uri[pos+1:]

Helper: `getPropertiesForClass()`

Collects all owl:DatatypeProperty instances whose rdfs:domain is the given class, returning them as SchemaProperty objects ready for neo4j-graphrag:

def getPropertiesForClass(g, cat):
    props = []
    for dtp in g.subjects(RDFS.domain, cat):
        if (dtp, RDF.type, OWL.DatatypeProperty) in g:
            propName = getLocalPart(dtp)
            propDesc = next(g.objects(dtp, RDFS.comment), "")
            props.append(SchemaProperty(name=propName, type="STRING", description=propDesc))
    return props

Core Function: `getSchemaFromOnto()`

This is the primary export. It walks the full OWL class and property hierarchy and assembles a SchemaConfig that can be passed directly into SimpleKGPipeline:

def getSchemaFromOnto(g) -> SchemaConfig:
    schema_builder = SchemaBuilder()
    classes = {}
    entities = []
    rels = []
    triples = []

    # Collect explicitly declared OWL classes
    for cat in g.subjects(RDF.type, OWL.Class):
        classes[cat] = None
        label = getLocalPart(cat)
        props = getPropertiesForClass(g, cat)
        entities.append(SchemaEntity(
            label=label,
            description=next(g.objects(cat, RDFS.comment), ""),
            properties=props
        ))

    # Also include classes implied by rdfs:domain declarations
    for cat in g.objects(None, RDFS.domain):
        if cat not in classes:
            classes[cat] = None
            label = getLocalPart(cat)
            props = getPropertiesForClass(g, cat)
            entities.append(SchemaEntity(
                label=label,
                description=next(g.objects(cat, RDFS.comment), ""),
                properties=props
            ))

    # Include classes implied by rdfs:range declarations (excluding XSD types)
    for cat in g.objects(None, RDFS.range):
        if not (cat.startswith("http://www.w3.org/2001/XMLSchema#") or cat in classes):
            classes[cat] = None
            label = getLocalPart(cat)
            props = getPropertiesForClass(g, cat)
            entities.append(SchemaEntity(
                label=label,
                description=next(g.objects(cat, RDFS.comment), ""),
                properties=props
            ))

    # Collect object properties as SchemaRelation objects
    for op in g.subjects(RDF.type, OWL.ObjectProperty):
        relname = getLocalPart(op)
        rels.append(SchemaRelation(
            label=relname,
            properties=[],
            description=next(g.objects(op, RDFS.comment), "")
        ))

    # Build the potential_schema triples: (domain_label, rel_label, range_label)
    for op in g.subjects(RDF.type, OWL.ObjectProperty):
        relname = getLocalPart(op)
        doms = [getLocalPart(dom) for dom in g.objects(op, RDFS.domain) if dom in classes]
        rans = [getLocalPart(ran) for ran in g.objects(op, RDFS.range) if ran in classes]
        for d in doms:
            for r in rans:
                triples.append((d, relname, r))

    return schema_builder.create_schema_model(
        entities=entities,
        relations=rels,
        potential_schema=triples
    )

Helper: `getPKs()`

Returns the local names of all properties declared as owl:InverseFunctionalProperty — these are the natural primary keys that can be used as merge keys during ingestion:

def getPKs(g):
    keys = []
    for k in g.subjects(RDF.type, OWL.InverseFunctionalProperty):
        keys.append(getLocalPart(k))
    return keys

owl:InverseFunctionalProperty is the OWL idiom for “this property value uniquely identifies the subject” — directly analogous to a primary key. Declaring identifier properties this way in your ontology gives the pipeline a principled basis for MERGE keys.

Using the Schema in `SimpleKGPipeline`

Once getSchemaFromOnto() has produced a SchemaConfig, it feeds directly into neo4j-graphrag’s SimpleKGPipeline for ontology-constrained KG construction:

from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.llm.openai_llm import OpenAILLM
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from rdflib import Graph

g = Graph()
g.parse("ontologies/domain.ttl")
neo4j_schema = getSchemaFromOnto(g)

splitter = FixedSizeSplitter(chunk_size=2500, chunk_overlap=10)
embedder = OpenAIEmbeddings(model="text-embedding-3-small")
llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "max_tokens": 3000,
        "response_format": {"type": "json_object"},
        "temperature": 0,
    },
)

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=driver,
    text_splitter=splitter,
    embedder=embedder,
    entities=list(neo4j_schema.entities.values()),
    relations=list(neo4j_schema.relations.values()),
    potential_schema=neo4j_schema.potential_schema,
    on_error="IGNORE",
    from_pdf=False,
)

End-to-End Flow Summary

Design the OWL ontology

Author an OWL/Turtle ontology covering your domain. Declare key properties as owl:InverseFunctionalProperty to enable deterministic MERGE keys.

Convert ontology to GraphRAG schema

Call getSchemaFromOnto(g) to convert the RDFLib graph into SchemaEntity, SchemaRelation, and potential_schema triples for neo4j-graphrag.

Run SimpleKGPipeline

Feed the schema into SimpleKGPipeline along with an LLM, embedder, and text splitter. The pipeline chunks your documents, extracts entities and relationships, and writes them to Neo4j.

Query with GraphRAG retrieval

Use the populated knowledge graph and its vector index to answer natural-language questions. The retriever walks the graph from semantically similar entry points, enriching the LLM context with structured facts.

Key Design Decisions

Ontology-driven schema

Using getSchemaFromOnto() means the SimpleKGPipeline schema is always derived from the ontology — there is one source of truth and no manual schema transcription.

Chunk size tuning

FixedSizeSplitter(chunk_size=2500, chunk_overlap=10) balances context window usage against extraction completeness. Larger chunks capture more entity co-occurrences but cost more tokens.

JSON response format

Setting response_format: {type: json_object} on the LLM forces structured output, which SimpleKGPipeline can parse reliably without brittle string manipulation.

on_error=IGNORE

Extraction errors for individual chunks are swallowed rather than aborting the full pipeline — appropriate for large document collections where occasional failures are acceptable.

Session 32 extends this pattern to heterogeneous data sources — PDFs, CSVs, and CRM exports — all unified under a single insurance/sales ontology.

Ontology-Guided KG Construction (S2)

Agents & Advanced Patterns (S2)

Season 3: LLMs, Agents & Quality

Ontology-Driven End-to-End GraphRAG Pipeline in Neo4j

Watch the Recording

Session Code

Pipeline Architecture

The `utils.py` Module

Helper: `getLocalPart()`

Helper: `getPropertiesForClass()`

Core Function: `getSchemaFromOnto()`

Helper: `getPKs()`

Using the Schema in `SimpleKGPipeline`

End-to-End Flow Summary

Key Design Decisions

Ontology-driven schema

Chunk size tuning

JSON response format

on_error=IGNORE

Build docs developers (and LLMs) love

Ontology-Guided KG Construction (S2)

Agents & Advanced Patterns (S2)

Season 3: LLMs, Agents & Quality

Documentation Index

Watch the Recording

Session Code

​Pipeline Architecture

​The utils.py Module

​Helper: getLocalPart()

​Helper: getPropertiesForClass()

​Core Function: getSchemaFromOnto()

​Helper: getPKs()

​Using the Schema in SimpleKGPipeline

​End-to-End Flow Summary

​Key Design Decisions

Ontology-driven schema

Chunk size tuning

JSON response format

on_error=IGNORE

Build docs developers (and LLMs) love

Pipeline Architecture

The `utils.py` Module

Helper: `getLocalPart()`

Helper: `getPropertiesForClass()`

Core Function: `getSchemaFromOnto()`

Helper: `getPKs()`

Using the Schema in `SimpleKGPipeline`

End-to-End Flow Summary

Key Design Decisions