Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jbarrasa/goingmeta/llms.txt

Use this file to discover all available pages before exploring further.

Session 31 (Season 2, Episode 4 — December 2024) assembles all the pieces from the preceding sessions into a single, continuous pipeline: start with an OWL ontology, extract a knowledge graph from documents, build a vector index on top of it, and then answer natural-language questions using graph-augmented retrieval. This is the first session in the series to demonstrate the complete GraphRAG loop — from raw documents all the way to an LLM-generated answer informed by structured graph data.

Watch the Recording

Full live-stream replay on YouTube

Session Code

Python utilities and pipeline scripts

Pipeline Architecture

The end-to-end pipeline consists of four stages that flow from design-time artefacts (the ontology) through run-time retrieval (RAG):
[1] OWL Ontology Design


[2] KG Construction from Documents
    (ontology-guided LLM extraction → Neo4j)


[3] Vector Index Creation
    (embed node text properties → Neo4j vector index)


[4] GraphRAG Retrieval + LLM Answer Generation
    (vector search → graph traversal → LLM)

The utils.py Module

Session 31’s utils.py provides the schema translation utilities that connect the ontology layer to the neo4j-graphrag library. It converts an OWL Graph (loaded with RDFLib) into the SchemaConfig objects that neo4j-graphrag’s SimpleKGPipeline and retrieval components understand.

Helper: getLocalPart()

Extracts the local name from a full URI — strips everything up to and including the last #, /, or ::
from rdflib import Graph
from rdflib.namespace import RDF, OWL, RDFS
from neo4j_graphrag.experimental.components.schema import (
    SchemaBuilder, SchemaEntity, SchemaProperty, SchemaRelation, SchemaConfig
)

def getLocalPart(uri):
    pos = uri.rfind('#')
    if pos < 0:
        pos = uri.rfind('/')
    if pos < 0:
        pos = uri.rindex(':')
    return uri[pos+1:]

Helper: getPropertiesForClass()

Collects all owl:DatatypeProperty instances whose rdfs:domain is the given class, returning them as SchemaProperty objects ready for neo4j-graphrag:
def getPropertiesForClass(g, cat):
    props = []
    for dtp in g.subjects(RDFS.domain, cat):
        if (dtp, RDF.type, OWL.DatatypeProperty) in g:
            propName = getLocalPart(dtp)
            propDesc = next(g.objects(dtp, RDFS.comment), "")
            props.append(SchemaProperty(name=propName, type="STRING", description=propDesc))
    return props

Core Function: getSchemaFromOnto()

This is the primary export. It walks the full OWL class and property hierarchy and assembles a SchemaConfig that can be passed directly into SimpleKGPipeline:
def getSchemaFromOnto(g) -> SchemaConfig:
    schema_builder = SchemaBuilder()
    classes = {}
    entities = []
    rels = []
    triples = []

    # Collect explicitly declared OWL classes
    for cat in g.subjects(RDF.type, OWL.Class):
        classes[cat] = None
        label = getLocalPart(cat)
        props = getPropertiesForClass(g, cat)
        entities.append(SchemaEntity(
            label=label,
            description=next(g.objects(cat, RDFS.comment), ""),
            properties=props
        ))

    # Also include classes implied by rdfs:domain declarations
    for cat in g.objects(None, RDFS.domain):
        if cat not in classes:
            classes[cat] = None
            label = getLocalPart(cat)
            props = getPropertiesForClass(g, cat)
            entities.append(SchemaEntity(
                label=label,
                description=next(g.objects(cat, RDFS.comment), ""),
                properties=props
            ))

    # Include classes implied by rdfs:range declarations (excluding XSD types)
    for cat in g.objects(None, RDFS.range):
        if not (cat.startswith("http://www.w3.org/2001/XMLSchema#") or cat in classes):
            classes[cat] = None
            label = getLocalPart(cat)
            props = getPropertiesForClass(g, cat)
            entities.append(SchemaEntity(
                label=label,
                description=next(g.objects(cat, RDFS.comment), ""),
                properties=props
            ))

    # Collect object properties as SchemaRelation objects
    for op in g.subjects(RDF.type, OWL.ObjectProperty):
        relname = getLocalPart(op)
        rels.append(SchemaRelation(
            label=relname,
            properties=[],
            description=next(g.objects(op, RDFS.comment), "")
        ))

    # Build the potential_schema triples: (domain_label, rel_label, range_label)
    for op in g.subjects(RDF.type, OWL.ObjectProperty):
        relname = getLocalPart(op)
        doms = [getLocalPart(dom) for dom in g.objects(op, RDFS.domain) if dom in classes]
        rans = [getLocalPart(ran) for ran in g.objects(op, RDFS.range) if ran in classes]
        for d in doms:
            for r in rans:
                triples.append((d, relname, r))

    return schema_builder.create_schema_model(
        entities=entities,
        relations=rels,
        potential_schema=triples
    )

Helper: getPKs()

Returns the local names of all properties declared as owl:InverseFunctionalProperty — these are the natural primary keys that can be used as merge keys during ingestion:
def getPKs(g):
    keys = []
    for k in g.subjects(RDF.type, OWL.InverseFunctionalProperty):
        keys.append(getLocalPart(k))
    return keys
owl:InverseFunctionalProperty is the OWL idiom for “this property value uniquely identifies the subject” — directly analogous to a primary key. Declaring identifier properties this way in your ontology gives the pipeline a principled basis for MERGE keys.

Using the Schema in SimpleKGPipeline

Once getSchemaFromOnto() has produced a SchemaConfig, it feeds directly into neo4j-graphrag’s SimpleKGPipeline for ontology-constrained KG construction:
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.llm.openai_llm import OpenAILLM
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from rdflib import Graph

g = Graph()
g.parse("ontologies/domain.ttl")
neo4j_schema = getSchemaFromOnto(g)

splitter = FixedSizeSplitter(chunk_size=2500, chunk_overlap=10)
embedder = OpenAIEmbeddings(model="text-embedding-3-small")
llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "max_tokens": 3000,
        "response_format": {"type": "json_object"},
        "temperature": 0,
    },
)

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=driver,
    text_splitter=splitter,
    embedder=embedder,
    entities=list(neo4j_schema.entities.values()),
    relations=list(neo4j_schema.relations.values()),
    potential_schema=neo4j_schema.potential_schema,
    on_error="IGNORE",
    from_pdf=False,
)

End-to-End Flow Summary

1

Design the OWL ontology

Author an OWL/Turtle ontology covering your domain. Declare key properties as owl:InverseFunctionalProperty to enable deterministic MERGE keys.
2

Convert ontology to GraphRAG schema

Call getSchemaFromOnto(g) to convert the RDFLib graph into SchemaEntity, SchemaRelation, and potential_schema triples for neo4j-graphrag.
3

Run SimpleKGPipeline

Feed the schema into SimpleKGPipeline along with an LLM, embedder, and text splitter. The pipeline chunks your documents, extracts entities and relationships, and writes them to Neo4j.
4

Query with GraphRAG retrieval

Use the populated knowledge graph and its vector index to answer natural-language questions. The retriever walks the graph from semantically similar entry points, enriching the LLM context with structured facts.

Key Design Decisions

Ontology-driven schema

Using getSchemaFromOnto() means the SimpleKGPipeline schema is always derived from the ontology — there is one source of truth and no manual schema transcription.

Chunk size tuning

FixedSizeSplitter(chunk_size=2500, chunk_overlap=10) balances context window usage against extraction completeness. Larger chunks capture more entity co-occurrences but cost more tokens.

JSON response format

Setting response_format: {type: json_object} on the LLM forces structured output, which SimpleKGPipeline can parse reliably without brittle string manipulation.

on_error=IGNORE

Extraction errors for individual chunks are swallowed rather than aborting the full pipeline — appropriate for large document collections where occasional failures are acceptable.
Session 32 extends this pattern to heterogeneous data sources — PDFs, CSVs, and CRM exports — all unified under a single insurance/sales ontology.

Build docs developers (and LLMs) love