Automated Knowledge Graph Construction with LLMs

Session 25 of Going Meta, broadcast on February 5, 2024, tackles automated knowledge graph construction: given a flat CSV dataset, can an LLM design an entity-relationship model and generate the Cypher import scripts to populate a Neo4j graph — with no manual schema design? Jesus Barrasa walks through a Python notebook that does exactly this using OpenAI’s gpt-4 Completions API, Kaggle’s Croissant metadata format, and a code-generation pipeline that produces executable Cypher from the LLM’s JSON output.

What You’ll Learn

How to use Kaggle’s Croissant JSON-LD metadata to describe datasets for LLM prompting
How to prompt gpt-4 to extract an entity-relationship model from a dataset description
How to generate Cypher MERGE and CREATE CONSTRAINT statements programmatically from the LLM output
How to load the generated graph into Neo4j in batched mode using the Python driver
How to visualise the generated data model with Graphviz

Datasets Used

NY Housing Dataset

A Kaggle dataset of New York housing listings. The LLM extracts entities like Property, Listing, and Neighborhood and maps CSV columns to graph properties.

Supply Chain Dataset

The DataCo Smart Supply Chain dataset. The LLM extracts entities like Order, Customer, Product, and Shipment from over 50 CSV features.

Step-by-Step Walkthrough

Load and explore Croissant metadata

Kaggle provides machine-readable dataset metadata in Croissant (JSON-LD) format. Load the metadata for the chosen dataset:

# Supply Chain dataset
croassant_file_path = 'https://raw.githubusercontent.com/jbarrasa/goingmeta/main/session25/resources/croissants/metadata-supplychain.json'
data_file_path = 'https://raw.githubusercontent.com/jbarrasa/goingmeta/main/session25/resources/csvs/SCMS_Delivery_History_Dataset_20150929.csv'

The Croissant file is JSON-LD and can be parsed into Neo4j with rdflib-neo4j:

from rdflib_neo4j import Neo4jStoreConfig, Neo4jStore, HANDLE_VOCAB_URI_STRATEGY
from rdflib import Graph

config = Neo4jStoreConfig(
    auth_data=auth_data,
    handle_vocab_uri_strategy=HANDLE_VOCAB_URI_STRATEGY.IGNORE,
    batching=True
)
graph_store = Graph(store=Neo4jStore(config=config))
temp = Graph()
temp.parse(croassant_file_path, format="json-ld")
graph_store.parse(data=temp.serialize(format="ttl"), format="ttl")
graph_store.close(True)

Extract feature metadata from the graph

With the Croissant metadata in Neo4j, query the graph to get a structured description of the dataset features to use in the LLM prompt:

features_query = """
MATCH (d:Dataset)-[:recordSet]->(rs:RecordSet)-[:field]->(f:Field)-[:dataType]->(dt),
      (f)-[:source]->()-[:extract]->(e)
RETURN d.name AS datasetname, d.description AS datasetdescription,
       collect({featurename: e.column, datatype: dt.uri}) AS datasetfeatures
"""
records, summary, keys = driver.execute_query(features_query)
result = records[0]
ds_name = result['datasetname']
ds_description = result['datasetdescription']
ds_features = result['datasetfeatures']

Prompt the LLM for an ER model

Build a structured prompt asking gpt-4 to extract entities, relationships, and attribute-to-feature mappings from the dataset description:

system = "You are a data modelling expert capable of creating high quality entity-relationship models from flat datasets"

prompt = f"""
From the list of features in the following dataset create a list of entities and relationships with their
attributes in a simple json format and map them to the features in the dataset.

The attributes don't need to be named after the features in the dataset, but they should be mapped to the corresponding feature name.
No extra text or comments, only the json as output.

DATASET NAME: {ds_name}

DATASET DESCRIPTION: {ds_description}

DATASET FEATURES: {ds_features}
"""

Call the Completions API:

from openai import OpenAI

client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-4",
    temperature=0,
    messages=[
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
    ]
)
genschema = completion.choices[0].message.content

Generate Cypher import scripts from the LLM output

The generateImportScript function transforms the LLM’s JSON entity-relationship model into executable Cypher:

def generateImportScript(sch):
    constraints = {}
    cypher_import = {}

    for e in sch["entities"]:
        constraints[e[e_name]] = (
            "CREATE CONSTRAINT IF NOT EXISTS FOR (n:" + e[e_name] +
            ") REQUIRE n._id IS UNIQUE; "
        )
        cypher = [
            "unwind $records AS record",
            "merge (n:`" + e[e_name] + "` { _id: " + getIdValExprFor(sch, e[e_name]) + "} )"
        ]
        for p in e["attributes"]:
            cypher.append("set n.`" + p[a_name] + "` = record.`" + p[map_att] + "`")
        cypher.append("return count(*) as total ;")
        cypher_import[e[e_name]] = ' \n'.join(cypher)

    for r in sch["relationships"]:
        cypher = [
            "unwind $records AS record",
            "match (source:" + r[r_from] + " { _id : " + getIdValExprFor(sch, r[r_from]) + "} )",
            "match (target:" + r[r_to] + " { _id : " + getIdValExprFor(sch, r[r_to]) + "} )",
            "merge (source)-[r:`" + r[r_name] + "`]->(target)",
            "return count(*) as total"
        ]
        cypher_import[r[r_name]] = ' \n'.join(cypher)

    return cypher_import, constraints

cypher_import, constraints = generateImportScript(eval(genschema))

Load data into Neo4j in batches

Execute the generated constraints and import queries against the CSV source file using the Python Neo4j driver in batch mode:

import pandas as pd
from neo4j import GraphDatabase

with GraphDatabase.driver(url, auth=(username, password)) as driver:
    for c in constraints.keys():
        driver.execute_query(constraints[c])

    session = driver.session(database="neo4j")
    for q in cypher_import.keys():
        print("importing " + q)
        df = pd.read_csv(data_file_path, encoding="ISO-8859-1")
        result = insert_data(session, cypher_import[q], df, batch_size=1000)
        print(result)

Fingerprint-Based Entity Deduplication

A key design choice in the import script is using apoc.hashing.fingerprint to derive stable _id values for entities from their attribute values — ensuring that rows in the CSV that refer to the same real-world entity are merged rather than duplicated:

def getIdValExprFor(sch, ename):
    result = []
    for x in sch["entities"]:
        if x[e_name] == ename:
            for y in x["attributes"]:
                result.append("toString(record.`" + y[map_att] + "`)")
    return " apoc.hashing.fingerprint( " + " + ".join(result) + " )"

Using apoc.hashing.fingerprint on all attribute values for an entity type is a pragmatic deduplication strategy when there is no natural primary key in the source CSV. For production use, consider using domain-specific identifiers if available.

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Automated Knowledge Graph Construction with LLMs

What You’ll Learn

Datasets Used

NY Housing Dataset

Supply Chain Dataset

Step-by-Step Walkthrough

Fingerprint-Based Entity Deduplication

Resources

Watch the Recording

Session Code

Build docs developers (and LLMs) love

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Documentation Index

​What You’ll Learn

​Datasets Used

NY Housing Dataset

Supply Chain Dataset

​Step-by-Step Walkthrough

​Fingerprint-Based Entity Deduplication

​Resources

Watch the Recording

Session Code

Build docs developers (and LLMs) love

What You’ll Learn

Datasets Used

Step-by-Step Walkthrough

Fingerprint-Based Entity Deduplication

Resources