Use this file to discover all available pages before exploring further.
Session 29 (Season 2, Episode 2 — October 2024) puts the concepts from Session 28 into working Python code. The full pipeline reads an unstructured text file, parses an OWL ontology with RDFLib, asks GPT-4o to extract entities and relationships as Cypher MERGE statements guided by that ontology, and finally executes those statements against a Neo4j database. Every component is visible, inspectable, and straightforward to adapt to a different domain ontology or text corpus.
The pipeline in kgbuilder-openai.py consists of exactly four steps: load text, parse ontology, prompt the LLM, and write to Neo4j. Each step is clearly delineated in the source file.
Unstructured text │ ▼ [STEP 1] Read text file │ ▼ [STEP 2] Parse OWL ontology → Turtle or Natural Language representation │ ▼ [STEP 3] GPT-4o: extract entities as Cypher MERGE statements │ ▼ [STEP 4] Execute Cypher against Neo4j
The script reads a plain-text file from the content/ directory. In the demo, the source material is a text description of David Hockney’s painting Mr and Mrs Clark and Percy:
with open('content/hockney-mr-and-mrs-clark-and-percy.txt', 'r') as file: content = file.read().replace('\n', '')
The replace('\n', '') flattens the file into a single string so the LLM receives uninterrupted prose rather than a pre-tokenised structure.
The ontology is loaded from a Turtle file using RDFLib’s Graph.parse(). There are two options for how to represent the ontology in the prompt:
from rdflib import Graphfrom utils import getNLOntologyg = Graph()g.parse("ontologies/art.ttl")# OPTION 1: Ontology in standard Turtle serialisationontology = g.serialize(format="ttl")# OPTION 2: Natural language description of the ontologyontology = getNLOntology(g)
getNLOntology() (in utils.py) translates the RDF graph into a human-readable text block structured around three sections — categories, attributes, and relationships — making it easier for the LLM to follow:
from rdflib import Graphfrom rdflib.namespace import RDF, OWL, RDFSdef getLocalPart(uri): pos = uri.rfind('#') if pos < 0: pos = uri.rfind('/') if pos < 0: pos = uri.rindex(':') return uri[pos+1:]def getNLOntology(g): result = '' definedcats = [] result += '\nCATEGORIES:\n' for cat in g.subjects(RDF.type, OWL.Class): result += getLocalPart(cat) definedcats.append(cat) for desc in g.objects(cat, RDFS.comment): result += ': ' + desc + '\n' result += '\nATTRIBUTES:\n' for att in g.subjects(RDF.type, OWL.DatatypeProperty): result += getLocalPart(att) for dom in g.objects(att, RDFS.domain): result += ': Attribute that applies to entities of type ' + getLocalPart(dom) for desc in g.objects(att, RDFS.comment): result += '. It represents ' + desc + '\n' result += '\nRELATIONSHIPS:\n' for att in g.subjects(RDF.type, OWL.ObjectProperty): result += getLocalPart(att) for dom in g.objects(att, RDFS.domain): result += ': Relationship that connects entities of type ' + getLocalPart(dom) for ran in g.objects(att, RDFS.range): result += ' to entities of type ' + getLocalPart(ran) for desc in g.objects(att, RDFS.comment): result += '. It represents ' + desc + '\n' return result
Option 2 (natural language) tends to produce better results for smaller ontologies because the LLM can read the categories and constraints as plain English rather than parsing Turtle syntax.
The system prompt establishes the LLM as an expert in structured information extraction. The user prompt injects the ontology and the content, then requests Cypher MERGE statements as output:
from openai import OpenAIimport osclient = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))system = ( "You are an expert in extracting structured information out of natural language text. " "You extract entities with their attributes and relationships between entities. " "You can produce the output as RDF triples or as Cypher write statements on request.")prompt = ( "Given the ontology below run your best entity extraction over the content.\n" " The extracted entities and relationships must be described using exclusively the terms in the ontology\n" " and in the way they are defined. This means that for attributes and relationships you will respect " "the domain and range constraints.\n" " You will never use terms not defined in the ontology.\n" "Return the output as Cypher using merge to allow for linkage of nodes from multiple passes.\n" "Absolutely no comments on the output. Just the structured output. " + '\n\nONTOLOGY: \n ' + ontology + '\n\nCONTENT: \n ' + content)chat_completion = client.chat.completions.create( messages=[ {'role': 'system', 'content': system}, {'role': 'user', 'content': prompt}, ], model="gpt-4o",)# Strip the markdown code fences the LLM wraps around Cyphercypher_script = chat_completion.choices[0].message.content[3:-3]print(cypher_script)
The slice [3:-3] removes the triple-backtick fences (```) that GPT-4o places around code blocks. If the model output format changes, this strip logic may need updating.
Because the LLM generates MERGE statements (not CREATE), running the pipeline over multiple text files will correctly link shared entities rather than creating duplicates.
import osfrom neo4jconnector import Neo4jConnectionfrom utils import getNLOntologyfrom openai import OpenAIfrom rdflib import Graph# STEP 1: Load the textwith open('content/hockney-mr-and-mrs-clark-and-percy.txt', 'r') as file: content = file.read().replace('\n', '')# STEP 2: Parse the ontologyg = Graph()g.parse("ontologies/art.ttl")ontology = getNLOntology(g) # or g.serialize(format="ttl")# STEP 3: Prompt GPT-4oclient = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))system = ( "You are an expert in extracting structured information out of natural language text. " "You extract entities with their attributes and relationships between entities. " "You can produce the output as RDF triples or as Cypher write statements on request.")prompt = ( "Given the ontology below run your best entity extraction over the content.\n" " The extracted entities and relationships must be described using exclusively the terms in the ontology " "and in the way they are defined. This means that for attributes and relationships you will respect " "the domain and range constraints.\n" " You will never use terms not defined in the ontology.\n" "Return the output as Cypher using merge to allow for linkage of nodes from multiple passes.\n" "Absolutely no comments on the output. Just the structured output. " + '\n\nONTOLOGY: \n ' + ontology + '\n\nCONTENT: \n ' + content)chat_completion = client.chat.completions.create( messages=[ {'role': 'system', 'content': system}, {'role': 'user', 'content': prompt}, ], model="gpt-4o",)cypher_script = chat_completion.choices[0].message.content[3:-3]# STEP 4: Write to Neo4jconn = Neo4jConnection("bolt://localhost:7687", "neo4j", "neoneoneo")result = conn.run_cypher(cypher_script)conn.close()
Passing the raw Turtle serialization works for LLMs well-versed in OWL syntax. The getNLOntology() helper is more reliable for general-purpose models and smaller ontologies.
MERGE for deduplication
Using MERGE instead of CREATE makes each pipeline run idempotent — re-running over the same or overlapping documents will not produce duplicate nodes.
Domain and range enforcement
The prompt explicitly instructs the LLM to respect domain and range constraints from the ontology. SHACL validation (Session 28) can catch any violations that slip through.
Extensible to any domain
Swap art.ttl for any OWL ontology and content/ for any text corpus — the four-step pipeline requires no other changes.
See Session 30 for an evolution of this pattern that adds PDF ingestion and a parametric Cypher import template, making the pipeline more repeatable across many documents.