Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jbarrasa/goingmeta/llms.txt

Use this file to discover all available pages before exploring further.

Session 26 of Going Meta, broadcast on March 5, 2024, takes a critical look at the data.world benchmark that claimed knowledge graphs improve LLM question-answering over relational data. Jesus Barrasa unpicks the methodology, reproduces the benchmark using Neo4j, and explores the role of R2RML mappings, OWL ontologies, and semantic layers in enabling the LLM to generate correct Cypher — while also highlighting what the benchmark does and doesn’t actually measure.

What You’ll Learn

  • How R2RML mappings translate relational schemas into graph structures for LLM consumption
  • How OWL ontologies provide a semantic layer that helps LLMs understand graph schema
  • How n10s.experimental.export.dimodel.fetch generates data integration config from an OWL ontology
  • How Neo4j’s self-describing schema enables dynamic context injection for LLM Cypher generation
  • What the data.world benchmark actually measures — and its limitations

The Benchmark Dataset

The session uses the ACME Insurance dataset from the data.world benchmark repository, which includes:
  • Source CSV files with insurance policies, policyholders, and agents
  • An OWL ontology defining the domain model (insurance.ttl)
  • R2RML mappings from relational tables to RDF triples

Source CSVs

Available at https://github.com/datadotworld/cwd-benchmark-data/tree/main/ACME_Insurance/data — insurance policies, agents, and policyholders.

Domain Ontology

The OWL ontology at ACME_Insurance/ontology/insurance.ttl defines Policy, PolicyHolder, Agent, and their relationships.

Step-by-Step Walkthrough

1

Generate a data integration model from the OWL ontology

Use n10s.experimental.export.dimodel.fetch to generate a Neo4j Workspace-compatible import configuration from a subset of the OWL classes:
CALL n10s.experimental.export.dimodel.fetch(
  "https://raw.githubusercontent.com/datadotworld/cwd-benchmark-data/main/ACME_Insurance/ontology/insurance.ttl",
  "Turtle",
  {
    classList: [
      "http://data.world/schema/insurance/Policy",
      "http://data.world/schema/insurance/PolicyHolder",
      "http://data.world/schema/insurance/Agent"
    ]
  }
);
The procedure saves the config to your local drive for use with the Neo4j import tool.
2

Inspect the model inline

Alternatively, use n10s.experimental.stream.dimodel.fetch to see the data integration model as a query result directly in the Neo4j browser:
CALL n10s.experimental.stream.dimodel.fetch(
  "https://raw.githubusercontent.com/datadotworld/cwd-benchmark-data/main/ACME_Insurance/ontology/insurance.ttl",
  "Turtle",
  {
    classList: [
      "http://data.world/schema/insurance/Policy",
      "http://data.world/schema/insurance/PolicyHolder",
      "http://data.world/schema/insurance/Agent"
    ]
  }
)
3

Import CSV data using the generated config

Load the generated config into the Neo4j import tool (Neo4j Workspace), map the CSV columns to the ontology-derived node and relationship types, and run the import job to populate the graph.
4

Query the populated graph

With the data loaded according to the ontology-constrained schema, run Cypher queries to verify the data and explore the model. For example, aggregate policies per agent:
MATCH (p:Policy)-[:soldByAgent]->(a:Agent)
RETURN a.agentId AS AgentID, COUNT(p) AS PoliciesSold
5

Expose the schema to the LLM for Cypher generation

The benchmark’s central claim is that having a semantic layer helps the LLM generate correct queries. Test this by passing the schema dynamically at query time:
// All node types and their properties
CALL db.schema.nodeTypeProperties()

// All relationship types and their properties
CALL apoc.meta.relTypeProperties()
These outputs are included in the LLM prompt alongside the natural language question from the benchmark, so the LLM can generate Cypher that matches the actual graph structure.

What the Benchmark Measures (and Doesn’t)

What It Measures

Whether providing a structured, ontology-derived schema description improves the accuracy of LLM-generated queries compared to providing a raw relational schema — a legitimate and useful experiment.

What It Doesn't Measure

The benchmark does not isolate the contribution of the graph structure itself from the quality of the schema description. A well-described relational schema might perform equally well.

R2RML's Role

R2RML mappings translate relational table definitions into RDF triples, which are then loaded into Neo4j. The semantic enrichment comes from the OWL ontology, not just the R2RML transformation.

Semantic Layer Value

The session argues that the real value is in the semantic layer — meaningful, consistent naming and ontological relationships — regardless of whether the underlying store is a graph or a relational database.
The Python notebook accompanying this session (session26/) demonstrates the full benchmark reproduction pipeline, including prompting the LLM with the benchmark questions and evaluating the generated Cypher against the expected results.

Resources

Watch the Recording

Full session recording on YouTube — March 5, 2024.

Session Code

Cypher scripts and Python notebook on GitHub.

Build docs developers (and LLMs) love