Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jbarrasa/goingmeta/llms.txt

Use this file to discover all available pages before exploring further.

Session 29 (Season 2, Episode 2 — October 2024) puts the concepts from Session 28 into working Python code. The full pipeline reads an unstructured text file, parses an OWL ontology with RDFLib, asks GPT-4o to extract entities and relationships as Cypher MERGE statements guided by that ontology, and finally executes those statements against a Neo4j database. Every component is visible, inspectable, and straightforward to adapt to a different domain ontology or text corpus.

Watch the Recording

Full live-stream replay on YouTube

Session Code

Python scripts: kgbuilder-openai.py and utils.py

Pipeline Overview

The pipeline in kgbuilder-openai.py consists of exactly four steps: load text, parse ontology, prompt the LLM, and write to Neo4j. Each step is clearly delineated in the source file.
Unstructured text


  [STEP 1] Read text file


  [STEP 2] Parse OWL ontology → Turtle or Natural Language representation


  [STEP 3] GPT-4o: extract entities as Cypher MERGE statements


  [STEP 4] Execute Cypher against Neo4j

Step 1 — Load the Unstructured Text

The script reads a plain-text file from the content/ directory. In the demo, the source material is a text description of David Hockney’s painting Mr and Mrs Clark and Percy:
with open('content/hockney-mr-and-mrs-clark-and-percy.txt', 'r') as file:
    content = file.read().replace('\n', '')
The replace('\n', '') flattens the file into a single string so the LLM receives uninterrupted prose rather than a pre-tokenised structure.

Step 2 — Parse the OWL Ontology

The ontology is loaded from a Turtle file using RDFLib’s Graph.parse(). There are two options for how to represent the ontology in the prompt:
from rdflib import Graph
from utils import getNLOntology

g = Graph()
g.parse("ontologies/art.ttl")

# OPTION 1: Ontology in standard Turtle serialisation
ontology = g.serialize(format="ttl")

# OPTION 2: Natural language description of the ontology
ontology = getNLOntology(g)

The getNLOntology() Function

getNLOntology() (in utils.py) translates the RDF graph into a human-readable text block structured around three sections — categories, attributes, and relationships — making it easier for the LLM to follow:
from rdflib import Graph
from rdflib.namespace import RDF, OWL, RDFS

def getLocalPart(uri):
    pos = uri.rfind('#')
    if pos < 0:
        pos = uri.rfind('/')
    if pos < 0:
        pos = uri.rindex(':')
    return uri[pos+1:]

def getNLOntology(g):
    result = ''
    definedcats = []

    result += '\nCATEGORIES:\n'
    for cat in g.subjects(RDF.type, OWL.Class):
        result += getLocalPart(cat)
        definedcats.append(cat)
        for desc in g.objects(cat, RDFS.comment):
            result += ': ' + desc + '\n'

    result += '\nATTRIBUTES:\n'
    for att in g.subjects(RDF.type, OWL.DatatypeProperty):
        result += getLocalPart(att)
        for dom in g.objects(att, RDFS.domain):
            result += ': Attribute that applies to entities of type ' + getLocalPart(dom)
        for desc in g.objects(att, RDFS.comment):
            result += '. It represents ' + desc + '\n'

    result += '\nRELATIONSHIPS:\n'
    for att in g.subjects(RDF.type, OWL.ObjectProperty):
        result += getLocalPart(att)
        for dom in g.objects(att, RDFS.domain):
            result += ': Relationship that connects entities of type ' + getLocalPart(dom)
        for ran in g.objects(att, RDFS.range):
            result += ' to entities of type ' + getLocalPart(ran)
        for desc in g.objects(att, RDFS.comment):
            result += '. It represents ' + desc + '\n'
    return result
Option 2 (natural language) tends to produce better results for smaller ontologies because the LLM can read the categories and constraints as plain English rather than parsing Turtle syntax.

Step 3 — Prompt GPT-4o for Cypher Extraction

The system prompt establishes the LLM as an expert in structured information extraction. The user prompt injects the ontology and the content, then requests Cypher MERGE statements as output:
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

system = (
    "You are an expert in extracting structured information out of natural language text. "
    "You extract entities with their attributes and relationships between entities. "
    "You can produce the output as RDF triples or as Cypher write statements on request."
)

prompt = (
    "Given the ontology below run your best entity extraction over the content.\n"
    " The extracted entities and relationships must be described using exclusively the terms in the ontology\n"
    " and in the way they are defined. This means that for attributes and relationships you will respect "
    "the domain and range constraints.\n"
    " You will never use terms not defined in the ontology.\n"
    "Return the output as Cypher using merge to allow for linkage of nodes from multiple passes.\n"
    "Absolutely no comments on the output. Just the structured output. "
    + '\n\nONTOLOGY: \n ' + ontology
    + '\n\nCONTENT: \n ' + content
)

chat_completion = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': system},
        {'role': 'user', 'content': prompt},
    ],
    model="gpt-4o",
)

# Strip the markdown code fences the LLM wraps around Cypher
cypher_script = chat_completion.choices[0].message.content[3:-3]
print(cypher_script)
The slice [3:-3] removes the triple-backtick fences (```) that GPT-4o places around code blocks. If the model output format changes, this strip logic may need updating.

Step 4 — Write to Neo4j

The generated Cypher is executed against a local Neo4j instance via Neo4jConnection, a thin wrapper around the official Python driver:
from neo4jconnector import Neo4jConnection

uri = "bolt://localhost:7687"
user = "neo4j"
password = "neoneoneo"
conn = Neo4jConnection(uri, user, password)

result = conn.run_cypher(cypher_script)
conn.close()
Because the LLM generates MERGE statements (not CREATE), running the pipeline over multiple text files will correctly link shared entities rather than creating duplicates.

Full Pipeline at a Glance

import os
from neo4jconnector import Neo4jConnection
from utils import getNLOntology
from openai import OpenAI
from rdflib import Graph

# STEP 1: Load the text
with open('content/hockney-mr-and-mrs-clark-and-percy.txt', 'r') as file:
    content = file.read().replace('\n', '')

# STEP 2: Parse the ontology
g = Graph()
g.parse("ontologies/art.ttl")
ontology = getNLOntology(g)   # or g.serialize(format="ttl")

# STEP 3: Prompt GPT-4o
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
system = (
    "You are an expert in extracting structured information out of natural language text. "
    "You extract entities with their attributes and relationships between entities. "
    "You can produce the output as RDF triples or as Cypher write statements on request."
)
prompt = (
    "Given the ontology below run your best entity extraction over the content.\n"
    " The extracted entities and relationships must be described using exclusively the terms in the ontology "
    "and in the way they are defined. This means that for attributes and relationships you will respect "
    "the domain and range constraints.\n"
    " You will never use terms not defined in the ontology.\n"
    "Return the output as Cypher using merge to allow for linkage of nodes from multiple passes.\n"
    "Absolutely no comments on the output. Just the structured output. "
    + '\n\nONTOLOGY: \n ' + ontology + '\n\nCONTENT: \n ' + content
)
chat_completion = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': system},
        {'role': 'user', 'content': prompt},
    ],
    model="gpt-4o",
)
cypher_script = chat_completion.choices[0].message.content[3:-3]

# STEP 4: Write to Neo4j
conn = Neo4jConnection("bolt://localhost:7687", "neo4j", "neoneoneo")
result = conn.run_cypher(cypher_script)
conn.close()

Design Choices and Trade-offs

Turtle vs Natural Language

Passing the raw Turtle serialization works for LLMs well-versed in OWL syntax. The getNLOntology() helper is more reliable for general-purpose models and smaller ontologies.

MERGE for deduplication

Using MERGE instead of CREATE makes each pipeline run idempotent — re-running over the same or overlapping documents will not produce duplicate nodes.

Domain and range enforcement

The prompt explicitly instructs the LLM to respect domain and range constraints from the ontology. SHACL validation (Session 28) can catch any violations that slip through.

Extensible to any domain

Swap art.ttl for any OWL ontology and content/ for any text corpus — the four-step pipeline requires no other changes.
See Session 30 for an evolution of this pattern that adds PDF ingestion and a parametric Cypher import template, making the pipeline more repeatable across many documents.

Build docs developers (and LLMs) love