Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jbarrasa/goingmeta/llms.txt

Use this file to discover all available pages before exploring further.

Session 32 (Season 2, Episode 5 — January 2025) tackles the practical challenge that appears the moment you try to build a production knowledge graph: your data never comes from a single source or in a single format. This session demonstrates how one well-designed OWL ontology can act as the integration hub for PDFs, CSVs, and CRM data simultaneously — producing a unified Neo4j graph that is immediately ready for GraphRAG retrieval. Two Python utilities do the heavy lifting: DIModelBuilder generates Neo4j Data Importer models from the OWL, and RAGSchemaFromOnto converts the same ontology into the neo4j-graphrag schema format.

Watch the Recording

Full live-stream replay on YouTube

Session Code

Python: RAGSchemaFromOnto.py and DIModelBuilder.py

The Integration Challenge

When building a KG from a single domain — say, legal contracts — a single ontology and a single pipeline is enough. Real-world scenarios typically involve:
  • PDF documents (contracts, reports, product specs)
  • Structured tabular data (CSV exports from CRMs, policy databases)
  • Semi-structured API responses (sales opportunity data, customer records)
The session uses an insurance/sales domain ontology (insurance.ttl / sales-onto.ttl) as the unifying schema. Every data source — regardless of format — is loaded in a way that respects the ontology’s class and property definitions.

Project Dependencies

The session code is packaged with a pyproject.toml. The four runtime dependencies are:
dependencies = [
    "streamlit",
    "rdflib",
    "requests",
    "neo4j-graphrag"
]
streamlit powers an interactive demo UI; rdflib handles OWL parsing; requests fetches ontology files from a local HTTP server; and neo4j-graphrag provides the SimpleKGPipeline and schema objects.

RAGSchemaFromOnto.py — Ontology to GraphRAG Schema

RAGSchemaFromOnto.py provides the same getSchemaFromOnto() function as Session 31, but with a key extension: it accepts a file path rather than a pre-loaded Graph, making it convenient to call with different ontology files without managing the RDFLib graph lifecycle externally.

Core Conversion Function

from rdflib import Graph
from rdflib.namespace import RDF, OWL, RDFS
from neo4j_graphrag.experimental.components.schema import (
    SchemaBuilder, SchemaEntity, SchemaProperty, SchemaRelation, SchemaConfig
)

def getSchemaFromOnto(path) -> SchemaConfig:
    g = Graph()
    g.parse(path)
    schema_builder = SchemaBuilder()
    classes = {}
    entities = []
    rels = []
    triples = []

    # Explicitly declared OWL classes
    for cat in g.subjects(RDF.type, OWL.Class):
        classes[cat] = None
        label = getLocalPart(cat)
        props = getPropertiesForClass(g, cat)
        entities.append(SchemaEntity(
            label=label,
            description=next(g.objects(cat, RDFS.comment), ""),
            properties=props
        ))

    # Classes implied by rdfs:domain declarations
    for cat in g.objects(None, RDFS.domain):
        if cat not in classes:
            classes[cat] = None
            label = getLocalPart(cat)
            props = getPropertiesForClass(g, cat)
            entities.append(SchemaEntity(
                label=label,
                description=next(g.objects(cat, RDFS.comment), ""),
                properties=props
            ))

    # Classes implied by rdfs:range declarations (excluding XSD types)
    for cat in g.objects(None, RDFS.range):
        if not (cat.startswith("http://www.w3.org/2001/XMLSchema#") or cat in classes):
            classes[cat] = None
            label = getLocalPart(cat)
            props = getPropertiesForClass(g, cat)
            entities.append(SchemaEntity(
                label=label,
                description=next(g.objects(cat, RDFS.comment), ""),
                properties=props
            ))

    # Object properties become SchemaRelation objects
    for op in g.subjects(RDF.type, OWL.ObjectProperty):
        relname = getLocalPart(op)
        rels.append(SchemaRelation(
            label=relname,
            properties=[],
            description=next(g.objects(op, RDFS.comment), "")
        ))

    # Potential schema: (domain_label, rel_label, range_label) triples
    for op in g.subjects(RDF.type, OWL.ObjectProperty):
        relname = getLocalPart(op)
        doms = [getLocalPart(d) for d in g.objects(op, RDFS.domain) if d in classes]
        rans = [getLocalPart(r) for r in g.objects(op, RDFS.range) if r in classes]
        for d in doms:
            for r in rans:
                triples.append((d, relname, r))

    return schema_builder.create_schema_model(
        entities=entities,
        relations=rels,
        potential_schema=triples
    )

getPKs() — Identifying Natural Keys

Properties declared as owl:InverseFunctionalProperty in the ontology are treated as unique identifiers — the OWL equivalent of a primary key:
def getPKs(g):
    keys = []
    for k in g.subjects(RDF.type, OWL.InverseFunctionalProperty):
        keys.append(getLocalPart(k))
    return keys
Marking a property as owl:InverseFunctionalProperty in your ontology is a deliberate design signal: “this property value uniquely identifies the subject.” getPKs() surfaces these so the ingestion pipeline can use them as MERGE keys, preventing duplicate nodes when the same entity appears in multiple source documents.

DIModelBuilder.py — Ontology to Data Importer Model

DIModelBuilder goes in a complementary direction: it converts the OWL ontology into the JSON format consumed by the Neo4j Data Importer, enabling visual, no-code loading of structured (CSV/tabular) data that conforms to the ontology schema.

Class Overview

from rdflib import URIRef, Graph
from rdflib.namespace import RDF, RDFS, OWL, XSD
from collections import defaultdict

class DIModelBuilder:
    MAX_NUM_NODES = 25
    MAX_NUM_RELS = 250

    def __init__(self):
        self.model_def = {}
        self.g = Graph()
The MAX_NUM_NODES and MAX_NUM_RELS guards prevent building an unwieldy import model from very large ontologies — a practical safeguard when working with foundational ontologies like schema.org.

Building the Import Model

build_di_model() is the entry point. It accepts raw RDF data (as a string), its format, and an optional classList to limit which ontology classes are included:
def build_di_model(self, rdf_data, rdf_format, props):
    uri_map = defaultdict(set)
    class_list = props.get("classList", [])
    urilist = self._format_uri_list(class_list) if class_list else None

    self.g.parse(data=rdf_data, format=rdf_format)
    all_classes = set()

    catsQuery = f"""
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        SELECT DISTINCT ?explicit ?parent
        WHERE {{
            ?explicit rdfs:subClassOf* ?parent
            FILTER( ?explicit IN ( {urilist} ) && isIRI(?parent))
        }}"""

    allCatsQuery = """
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        SELECT DISTINCT ?explicit ?parent
        WHERE {
          ?explicit a ?classtype
            FILTER( (?classtype IN ( owl:Class, rdfs:Class ))
                    && NOT EXISTS { ?x rdfs:subClassOf ?explicit })
          ?explicit rdfs:subClassOf* ?parent
            FILTER( isIRI(?parent) )
        }"""

    for row in self.g.query(catsQuery if class_list else allCatsQuery):
        all_classes.add(row.parent)
        uri_map[row.parent].add(row.explicit)
    ...
The SPARQL queries walk the rdfs:subClassOf* hierarchy, so even classes defined as subclasses of the selected ones are included in the import model.

Exporting for Neo4j Data Importer v2

The get_model_as_serialisable_object_v2() method produces a JSON document that the Neo4j Data Importer v2 can open directly:
def get_model_as_serialisable_object_v2(self, use_labels=False, make_schema_query_friendly=False):
    self.assign_positions_to_nodes()
    nodes = []
    node_object_types = []
    node_pos = 0
    for node in self.model_def.values():
        nodes.append(node.get_graph_node_as_json_v2(node_pos))
        node_object_types.append(node.get_node_object_type_v2(node_pos))
        node_pos += 1

    node_schemas = []
    rel_schemas = []
    for k, v in self.model_def.items():
        node_schemas.append(v.get_node_schemas_as_json_v2(self.g, use_labels, make_schema_query_friendly))
        rel_schemas.extend(v.get_rel_schemas_v2(self.g, use_labels, make_schema_query_friendly))

    rel_object_types = []
    pos = 0
    for node in self.model_def.values():
        rel_object_types.extend(node.get_rel_object_type_v2(pos, node_object_types))
        pos = len(rel_object_types)

    return {
        "version": "2.2.0",
        "visualisation": {"nodes": nodes},
        "dataModel": {
            "version": "2.2.0",
            "graphSchemaRepresentation": {
                "version": "1.0.0",
                "graphSchema": {
                    "nodeLabels": node_schemas,
                    "relationshipTypes": rel_schemas,
                    "nodeObjectTypes": node_object_types,
                    "relationshipObjectTypes": rel_object_types,
                    "constraints": [],
                    "indexes": []
                }
            },
            ...
        }
    }

Running the Builder

import requests
from DIModelBuilder import DIModelBuilder

link = "http://localhost:8000/ontos/sales-onto.ttl"
resp = requests.get(link)

mb = DIModelBuilder()
mb.build_di_model(resp.text, "ttl", {})

# Optionally filter to a specific set of classes:
# mb.build_di_model(resp.text, "ttl", {
#     "classList": [
#         "http://example.org/onto#Policy",
#         "http://example.org/onto#Customer"
#     ]
# })

mb.export_model_to_file("output.json", mb.get_model_as_serialisable_object_v2())

The Integration Architecture

PDFs via LLM extraction

Unstructured PDF documents are processed through SimpleKGPipeline using the schema produced by getSchemaFromOnto(), writing extracted entities directly to Neo4j.

CSVs via Data Importer

Structured tabular data is loaded through the Neo4j Data Importer using the model produced by DIModelBuilder, mapping CSV columns to ontology properties.

Single ontology as truth

Both ingestion paths share the same OWL ontology — the LLM extraction schema and the Data Importer model are both derived from it, guaranteeing a consistent node/relationship vocabulary.

GraphRAG-ready output

The resulting unified graph can immediately be queried with neo4j-graphrag retrieval components, since the schema objects are derived from the same ontology.
Session 33 benchmarks the different retrieval strategies you can apply to the unified knowledge graph built in this session — vector search, keyword search, graph traversal, and hybrids.

Build docs developers (and LLMs) love