Session 30 (Season 2, Episode 3 — November 2024) builds on the four-step pipeline introduced in Session 29 and pushes it toward production-readiness. The key additions are reading source material from a PDF (usingDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/jbarrasa/goingmeta/llms.txt
Use this file to discover all available pages before exploring further.
pypdf) and a parametric Cypher import template that decouples the LLM’s extraction output from the database write logic. Together, these two elements constitute a more rigorous blueprint — a template for KG construction that can be applied consistently across many documents in the same domain.
Watch the Recording
Full live-stream replay on YouTube
Session Code
Python scripts and Cypher import statements
What a Blueprint Adds
In Session 29, the LLM was asked to produce free-form Cypher and the result was executed directly. The blueprint approach introduces two additional layers of structure:- PDF ingestion — the input is a multi-page PDF document rather than a plain-text file.
pypdf’sPdfReaderiterates over all pages and concatenates the text into a single string. - Parametric Cypher import — instead of generating bespoke Cypher per document, a single parameterised Cypher template receives the structured JSON and writes to Neo4j. The LLM produces JSON; the Cypher is fixed and version-controlled.
The Domain: Contract Analysis
The demo in this session applies the blueprint to a publicly available PDF of the Simplicity Esports & Gaming Company contract. The domain ontology iscontract.ttl — an OWL ontology covering Agreement, Organization, Country, ContractClause, ClauseType, and Excerpt.
extract_cypher.py — Step by Step
extract_cypher.py follows the same four-step structure as Session 29, extended for PDF input and a named database target:
This session’s
extract_cypher.py extends the Session 29 pipeline by targeting a named database (gm3) and reading from a PDF rather than a plain-text file. The pypdf library handles multi-page PDF extraction, concatenating all pages into a single text block. The Turtle serialisation is used as the default ontology representation (rather than the natural-language option from Session 29).The Parametric Cypher Import Template
Rather than generating bespoke Cypher for every document, the blueprint uses a single parameterised import statement stored inimport.cypher. The LLM is asked to produce JSON that matches the structure this template expects:
Blueprint Patterns
Define the domain ontology
Author a Turtle OWL ontology (
contract.ttl, art.ttl, etc.) that specifies the entity classes, datatype properties, and object properties for your domain. This ontology is both the LLM prompt ingredient and the source of truth for the graph schema.Author the parametric import Cypher
Write a single
import.cypher template that accepts $jsondata and uses MERGE + UNWIND to handle lists. This Cypher never changes between documents — only the JSON payload changes.Prompt the LLM for structured JSON
Pass the serialized ontology (Turtle or natural language via
getNLOntology()) plus the document text to the LLM. Instruct it to produce JSON conforming to the structure expected by your import template.Why This Is More Robust
Deterministic import logic
The Cypher import template is fixed — there are no LLM-generated database writes, which eliminates a whole class of malformed query errors.
Schema-aligned JSON
Asking the LLM for JSON (not Cypher) separates extraction from persistence. The JSON can be logged, inspected, and re-ingested without re-calling the LLM.
PDF and text support
Replacing
open() with PdfReader makes the blueprint applicable to the vast majority of real enterprise documents without changing any other part of the pipeline.Named database targeting
Passing a
dbname argument to Neo4jConnection allows different ontology domains to land in separate Neo4j databases.