Session 25 of Going Meta, broadcast on February 5, 2024, tackles automated knowledge graph construction: given a flat CSV dataset, can an LLM design an entity-relationship model and generate the Cypher import scripts to populate a Neo4j graph — with no manual schema design? Jesus Barrasa walks through a Python notebook that does exactly this using OpenAI’sDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/jbarrasa/goingmeta/llms.txt
Use this file to discover all available pages before exploring further.
gpt-4 Completions API, Kaggle’s Croissant metadata format, and a code-generation pipeline that produces executable Cypher from the LLM’s JSON output.
What You’ll Learn
- How to use Kaggle’s Croissant JSON-LD metadata to describe datasets for LLM prompting
- How to prompt
gpt-4to extract an entity-relationship model from a dataset description - How to generate Cypher
MERGEandCREATE CONSTRAINTstatements programmatically from the LLM output - How to load the generated graph into Neo4j in batched mode using the Python driver
- How to visualise the generated data model with Graphviz
Datasets Used
NY Housing Dataset
A Kaggle dataset of New York housing listings. The LLM extracts entities like
Property, Listing, and Neighborhood and maps CSV columns to graph properties.Supply Chain Dataset
The DataCo Smart Supply Chain dataset. The LLM extracts entities like
Order, Customer, Product, and Shipment from over 50 CSV features.Step-by-Step Walkthrough
Load and explore Croissant metadata
Kaggle provides machine-readable dataset metadata in Croissant (JSON-LD) format. Load the metadata for the chosen dataset:The Croissant file is JSON-LD and can be parsed into Neo4j with
rdflib-neo4j:Extract feature metadata from the graph
With the Croissant metadata in Neo4j, query the graph to get a structured description of the dataset features to use in the LLM prompt:
Prompt the LLM for an ER model
Build a structured prompt asking Call the Completions API:
gpt-4 to extract entities, relationships, and attribute-to-feature mappings from the dataset description:Generate Cypher import scripts from the LLM output
The
generateImportScript function transforms the LLM’s JSON entity-relationship model into executable Cypher:Fingerprint-Based Entity Deduplication
A key design choice in the import script is usingapoc.hashing.fingerprint to derive stable _id values for entities from their attribute values — ensuring that rows in the CSV that refer to the same real-world entity are merged rather than duplicated:
Using
apoc.hashing.fingerprint on all attribute values for an entity type is a pragmatic deduplication strategy when there is no natural primary key in the source CSV. For production use, consider using domain-specific identifiers if available.Resources
Watch the Recording
Full session recording on YouTube — February 5, 2024.
Session Code
Python notebook and dataset resources on GitHub.