Blueprints for KG Construction from Unstructured Data

Session 30 (Season 2, Episode 3 — November 2024) builds on the four-step pipeline introduced in Session 29 and pushes it toward production-readiness. The key additions are reading source material from a PDF (using pypdf) and a parametric Cypher import template that decouples the LLM’s extraction output from the database write logic. Together, these two elements constitute a more rigorous blueprint — a template for KG construction that can be applied consistently across many documents in the same domain.

Watch the Recording

Full live-stream replay on YouTube

Session Code

Python scripts and Cypher import statements

What a Blueprint Adds

In Session 29, the LLM was asked to produce free-form Cypher and the result was executed directly. The blueprint approach introduces two additional layers of structure:

PDF ingestion — the input is a multi-page PDF document rather than a plain-text file. pypdf’s PdfReader iterates over all pages and concatenates the text into a single string.
Parametric Cypher import — instead of generating bespoke Cypher per document, a single parameterised Cypher template receives the structured JSON and writes to Neo4j. The LLM produces JSON; the Cypher is fixed and version-controlled.

The Domain: Contract Analysis

The demo in this session applies the blueprint to a publicly available PDF of the Simplicity Esports & Gaming Company contract. The domain ontology is contract.ttl — an OWL ontology covering Agreement, Organization, Country, ContractClause, ClauseType, and Excerpt.

`extract_cypher.py` — Step by Step

extract_cypher.py follows the same four-step structure as Session 29, extended for PDF input and a named database target:

from pypdf import PdfReader
from openai import OpenAI
from utils import getNLOntology
from rdflib import Graph
import os
from neo4jconnector import Neo4jConnection

# STEP 1: Extract text from PDF
filename = "data/SimplicityEsportsGamingCompany.pdf"
reader = PdfReader(filename)
text = ""
for page in reader.pages:
    text += page.extract_text()

# STEP 2: Parse the OWL ontology
g = Graph()
g.parse("ontos/contract.ttl")

# OPTION 1: Ontology in standard Turtle serialisation (used by default)
ontology = g.serialize(format="ttl")

# OPTION 2: Natural language description of the ontology
# ontology = getNLOntology(g)

# STEP 3: Prompt GPT-4o for Cypher extraction
client = OpenAI(api_key=os.environ.get("MY_OPENAI_KEY"))

system = (
    "You are an expert in extracting structured information out of natural language text. "
    "You extract entities with their attributes and relationships between entities. "
    "You can produce the output as RDF triples or as Cypher write statements on request."
)

prompt = (
    "Given the ontology below run your best entity extraction over the content.\n"
    " The extracted entities and relationships must be described using exclusively the terms in the ontology "
    "and in the way they are defined. This means that for attributes and relationships you will respect "
    "the domain and range constraints.\n"
    " You will never use terms not defined in the ontology.\n"
    "Return the output as Cypher using merge to allow for linkage of nodes from multiple passes.\n"
    "Absolutely no comments on the output. Just the structured output. "
    + '\n\nONTOLOGY: \n ' + ontology + '\n\nCONTENT: \n ' + text
)

chat_completion = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': system},
        {'role': 'user', 'content': prompt},
    ],
    model="gpt-4o",
)

cypher_script = chat_completion.choices[0].message.content[3:-3]

# STEP 4: Write to Neo4j (named database)
uri = "bolt://localhost:7687"
user = "neo4j"
password = "neoneoneo"
dbname = "gm3"
conn = Neo4jConnection(uri, user, password, dbname)
result = conn.run_cypher(cypher_script, {})
conn.close()

This session’s extract_cypher.py extends the Session 29 pipeline by targeting a named database (gm3) and reading from a PDF rather than a plain-text file. The pypdf library handles multi-page PDF extraction, concatenating all pages into a single text block. The Turtle serialisation is used as the default ontology representation (rather than the natural-language option from Session 29).

The Parametric Cypher Import Template

Rather than generating bespoke Cypher for every document, the blueprint uses a single parameterised import statement stored in import.cypher. The LLM is asked to produce JSON that matches the structure this template expects:

WITH $jsondata AS value

// Create the Agreement node
MERGE (agreement:Agreement {
    agreement_type: value.agreement.agreement_type,
    contract_id: value.agreement.contract_id,
    effective_date: value.agreement.effective_date,
    expiration_date: value.agreement.expiration_date,
    renewal_term: value.agreement.renewal_term,
    name: value.agreement.name
})

// Create the Country and Governed By Law relationship
MERGE (country:Country {name: value.governed_by_law.country.name})
MERGE (agreement)-[:GOVERNED_BY_LAW]->(country)

// Create Party nodes and relationships
WITH value, agreement
UNWIND value.parties AS party
MERGE (org:Organization {name: party.name, role: party.role})
MERGE (country:Country {name: party.incorporated_in.country.name})
MERGE (org)-[:INCORPORATED_IN]->(country)
MERGE (org)-[:IS_PARTY_TO]->(agreement)

// Create Clause nodes and relationships
WITH value, agreement
UNWIND value.clauses AS clause
MERGE (contractClause:ContractClause {name: clause.name})
MERGE (clauseType:ClauseType {name: clause.clause_type})
MERGE (contractClause)-[:HAS_TYPE]->(clauseType)
MERGE (agreement)-[:HAS_CLAUSE]->(contractClause)

// Create Excerpt nodes and relationships
WITH clause, contractClause
UNWIND clause.excerpts AS excerpt
MERGE (excerptNode:Excerpt {text: excerpt.text})
MERGE (contractClause)-[:HAS_EXCERPT]->(excerptNode);

This template is the heart of the blueprint pattern. The LLM’s job is no longer to write Cypher — it writes structured JSON. The Cypher import statement is deterministic and version-controlled.

Blueprint Patterns

Define the domain ontology

Author a Turtle OWL ontology (contract.ttl, art.ttl, etc.) that specifies the entity classes, datatype properties, and object properties for your domain. This ontology is both the LLM prompt ingredient and the source of truth for the graph schema.

Author the parametric import Cypher

Write a single import.cypher template that accepts $jsondata and uses MERGE + UNWIND to handle lists. This Cypher never changes between documents — only the JSON payload changes.

Prompt the LLM for structured JSON

Pass the serialized ontology (Turtle or natural language via getNLOntology()) plus the document text to the LLM. Instruct it to produce JSON conforming to the structure expected by your import template.

Validate and ingest

Optionally validate the returned JSON before passing it to Neo4j. Execute the parametric Cypher with the JSON as the $jsondata parameter.

Why This Is More Robust

Deterministic import logic

The Cypher import template is fixed — there are no LLM-generated database writes, which eliminates a whole class of malformed query errors.

Schema-aligned JSON

Asking the LLM for JSON (not Cypher) separates extraction from persistence. The JSON can be logged, inspected, and re-ingested without re-calling the LLM.

PDF and text support

Replacing open() with PdfReader makes the blueprint applicable to the vast majority of real enterprise documents without changing any other part of the pipeline.

Named database targeting

Passing a dbname argument to Neo4jConnection allows different ontology domains to land in separate Neo4j databases.

The next session, Session 31, extends the blueprint into a full GraphRAG pipeline by adding vector index creation and retrieval-augmented generation on top of the constructed knowledge graph.

Ontology-Guided KG Construction (S2)

Agents & Advanced Patterns (S2)

Season 3: LLMs, Agents & Quality

Blueprints for KG Construction from Unstructured Data

Watch the Recording

Session Code

What a Blueprint Adds

The Domain: Contract Analysis

`extract_cypher.py` — Step by Step

The Parametric Cypher Import Template

Blueprint Patterns

Why This Is More Robust

Deterministic import logic

Schema-aligned JSON

PDF and text support

Named database targeting

Build docs developers (and LLMs) love

Ontology-Guided KG Construction (S2)

Agents & Advanced Patterns (S2)

Season 3: LLMs, Agents & Quality

Documentation Index

Watch the Recording

Session Code

​What a Blueprint Adds

​The Domain: Contract Analysis

​extract_cypher.py — Step by Step

​The Parametric Cypher Import Template

​Blueprint Patterns

​Why This Is More Robust

Deterministic import logic

Schema-aligned JSON

PDF and text support

Named database targeting

Build docs developers (and LLMs) love

What a Blueprint Adds

The Domain: Contract Analysis

`extract_cypher.py` — Step by Step

The Parametric Cypher Import Template

Blueprint Patterns

Why This Is More Robust