Reconcile Taxonomies from Multiple Sources in Neo4j

Session 14 of Going Meta (broadcast March 7, 2023) tackles one of the most common real-world knowledge-graph challenges: reconciling disease taxonomies that have been developed independently by different organisations. Using Wikidata, the Medical Subject Headings (MeSH), and the Disease Ontology (DO) as examples, the session shows how to load all three SKOS-style hierarchies into Neo4j with Neosemantics, align their cross-references, and use Cypher pattern matching to discover structural discrepancies and infer missing equivalence links.

Watch Recording

Full session recording on YouTube

Source Code

Cypher scripts for the full reconciliation workflow

Overview


Broadcast	March 7, 2023
Tags	`RDF` `SPARQL` `Cypher`
Taxonomies used	Wikidata · MeSH · Disease Ontology
Key procedure	`n10s.rdf.import.fetch`, `n10s.rdf.stream.fetch`, `n10s.rdf.import.inline`

What You Will Learn

Setting up Neosemantics for RDF import with URI mapping
Constructing SPARQL queries to pull disease hierarchies from Wikidata and MeSH
Loading an OWL ontology file selectively using n10s.rdf.stream.fetch
Converting cross-reference properties into explicit SAME_AS relationships
Detecting structural discrepancies: different granularities, generalisations, and missing links
Generating Wikidata enrichment triples from incomplete “triangles” found in the graph

Setup

// Create constraint (required to import RDF with n10s)
CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE (r.uri) IS UNIQUE;

// Graph config — MAP mode preserves vocabulary URIs
CALL n10s.graphconfig.init({ handleVocabUris: "MAP" });

Step-by-Step Walkthrough

Import the Wikidata disease taxonomy

A SPARQL CONSTRUCT query assembles the hierarchy of infectious diseases (wd:Q18123741) along with cross-references to MeSH and the Disease Ontology. The result is fetched as N-Triples and loaded with n10s.rdf.import.fetch.

WITH '
PREFIX neo: <neo://voc#>
CONSTRUCT {
  ?dis a neo:WD_Disease ;
     neo:label ?disName ;
     neo:HAS_PARENT ?parentDisease ;
     neo:SAME_AS ?meshUri ;
     neo:SAME_AS ?diseaseOntoUri .
}
WHERE {
  ?dis wdt:P31/wdt:P279* wd:Q18123741 ;
       rdfs:label ?disName . FILTER(lang(?disName) = "en")

  OPTIONAL { ?dis wdt:P279 ?parentDisease .
             ?parentDisease wdt:P31/wdt:P279* wd:Q18123741 }
  OPTIONAL { ?dis wdt:P486 ?meshCode .
             BIND(URI(CONCAT("http://id.nlm.nih.gov/mesh/", ?meshCode)) AS ?meshUri) }
  OPTIONAL { ?dis wdt:P699 ?diseaseOntoId .
             BIND(URI(CONCAT("http://purl.obolibrary.org/obo/",
                      REPLACE(?diseaseOntoId, ":", "_"))) AS ?diseaseOntoUri) }
}
' AS query
CALL n10s.rdf.import.fetch(
  "https://query.wikidata.org/sparql?query=" + apoc.text.urlencode(query),
  "N-Triples",
  { headerParams: { Accept: "text/plain" } }
)
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo

Remove shortcut relationships from Wikidata

The Wikidata hierarchy sometimes contains “shortcuts” — direct HAS_PARENT links that skip intermediate nodes. These create noise in path-based analysis and should be removed.

// Identify shortcuts: a node has a direct parent link that is also reachable
// via two or more intermediate steps
MATCH shortcutPattern =
  (v:WD_Disease)<-[co:HAS_PARENT*2..]-(child)-[shortcut:HAS_PARENT]->(v)
RETURN shortcutPattern LIMIT 2

// Remove them
MATCH (v:WD_Disease)<-[co:HAS_PARENT*2..]-(child)-[shortcut:HAS_PARENT]->(v)
DELETE shortcut

Import the MeSH taxonomy

Pull the infectious disease branch (mesh:D007239) from the MeSH SPARQL endpoint using a CONSTRUCT query that maps predicates to the same HAS_PARENT / label vocabulary used for Wikidata.

WITH '
PREFIX rdfs:  <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
PREFIX mesh:  <http://id.nlm.nih.gov/mesh/>
PREFIX neo:   <neo://voc#>

CONSTRUCT {
  ?s a neo:Mesh_Disease ;
       neo:label ?name ;
       neo:HAS_PARENT ?parentDescriptor .
}
FROM <http://id.nlm.nih.gov/mesh>
WHERE {
  { ?s meshv:broaderDescriptor* mesh:D007239 }
  ?s rdfs:label ?name .
  OPTIONAL { ?s meshv:broaderDescriptor ?parentDescriptor . }
}
' AS query
CALL n10s.rdf.import.fetch(
  "https://id.nlm.nih.gov/mesh/sparql?format=TURTLE&query=" + apoc.text.urlencode(query),
  "Turtle"
)
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo

// Remove MeSH shortcuts as well
MATCH (v:Mesh_Disease)<-[co:HAS_PARENT*2..]-(child)-[shortcut:HAS_PARENT]->(v)
DELETE shortcut

Load the Disease Ontology with selective streaming

The Disease Ontology is available as an OWL/RDF-XML file. Because it is large and contains non-disease content, n10s.rdf.stream.fetch is used to collect only owl:Class subjects first, then filter to the relevant predicates before importing.

// Set up vocabulary mappings so owl:Class → DO_Disease
// and rdfs:subClassOf → HAS_PARENT
CALL n10s.nsprefixes.add("rdfs", "http://www.w3.org/2000/01/rdf-schema#");
CALL n10s.mapping.add("http://www.w3.org/2000/01/rdf-schema#subClassOf", "HAS_PARENT");
CALL n10s.nsprefixes.add("owl", "http://www.w3.org/2002/07/owl#");
CALL n10s.mapping.add("http://www.w3.org/2002/07/owl#Class", "DO_Disease");

// Stream the OWL file, collect owl:Class URIs, then import selectively
CALL n10s.rdf.stream.fetch(
  "http://purl.obolibrary.org/obo/doid.owl", "RDF/XML", { limit: 999999 }
)
YIELD subject, predicate, object
WHERE predicate = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
  AND object    = "http://www.w3.org/2002/07/owl#Class"
WITH collect(subject) AS class_uris

CALL n10s.rdf.stream.fetch(
  "http://purl.obolibrary.org/obo/doid.owl", "RDF/XML", { limit: 999999 }
)
YIELD subject, predicate, object, isLiteral, literalType, literalLang, subjectSPO
WHERE subject IN class_uris
  AND (
        predicate IN [
          "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
          "http://www.w3.org/2000/01/rdf-schema#label"
        ]
      OR predicate = "http://www.w3.org/2000/01/rdf-schema#subClassOf"
         AND n10s.rdf.isIRI(object)
      OR predicate = "http://www.geneontology.org/formats/oboInOwl#hasDbXref"
         AND object STARTS WITH "MESH:"
      )
WITH n10s.rdf.collect.nt(
       subject, predicate, object, isLiteral, literalType, literalLang, subjectSPO
     ) AS taxonomy
CALL n10s.rdf.import.inline(taxonomy, "N-Triples")
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo, callParams
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo, callParams

The selective streaming approach collects only owl:Class members in a first pass, then imports only the predicates needed for the reconciliation exercise. This substantially reduces import time and keeps the graph free of irrelevant OWL axioms.

Convert cross-reference properties to SAME_AS relationships

The Disease Ontology stores MeSH cross-references as properties (e.g. hasDbXref: "MESH:D007239"). Converting these to explicit SAME_AS relationships makes the graph consistent with the Wikidata taxonomy.

MATCH (doe:DO_Disease) WHERE doe.hasDbXref IS NOT NULL
MERGE (mesh:Resource {
  uri: "http://id.nlm.nih.gov/mesh/" + substring(doe.hasDbXref, 5)
})
MERGE (doe)-[:SAME_AS]->(mesh)
REMOVE doe.hasDbXref

Discover reconciliation patterns

With three aligned taxonomies sharing HAS_PARENT and SAME_AS, Cypher can surface structural differences at scale.Pattern 1 — Different granularities (same concept mapped at different depths):

MATCH topLink    = (topDo:DO_Disease)-[:SAME_AS]-(topMesh:Mesh_Disease)
MATCH bottomLink = (bottomDo:DO_Disease)-[:SAME_AS]-(bottomMesh:Mesh_Disease)
MATCH txnDo      = (topDo)<-[:HAS_PARENT*]-(bottomDo)
MATCH txnMesh    = (topMesh)<-[:HAS_PARENT*]-(bottomMesh)
WHERE length(txnDo) <> length(txnMesh)
RETURN * LIMIT 1

Pattern 2 — Generalisations (multiple concepts in one taxonomy mapped to one in another):

MATCH multiXRef = (md1:DO_Disease)-[:SAME_AS]-(start:Mesh_Disease)-[:SAME_AS]-(md2:DO_Disease)
OPTIONAL MATCH link = (md1)-[r:HAS_PARENT*]->(md2)
RETURN multiXRef, link

Pattern 3 — Perfect triangles (concept is aligned across all three taxonomies):

MATCH triangle =
  (wdid:WD_Disease)-[:SAME_AS]-(do:DO_Disease)-[:SAME_AS]-(md:Mesh_Disease)-[:SAME_AS]-(wdid)
WHERE size([path = (wdid)-[:SAME_AS]-() | path]) =
      size([path = (do)-[:SAME_AS]-()   | path]) =
      size([path = (md)-[:SAME_AS]-()   | path]) = 2
RETURN triangle LIMIT 50

Infer and export missing links

When an incomplete triangle is found (two of three legs exist), the missing equivalence link can be inferred and exported as RDF triples to enrich the source vocabulary.

// Find incomplete triangles where the WD ↔ MeSH link is missing
MATCH incomplete =
  (wdid:WD_Disease)-[:SAME_AS]-(do:DO_Disease)-[:SAME_AS]-(md:Mesh_Disease)
WHERE NOT (md)-[:SAME_AS]-(wdid)
  AND size([path = (wdid)-[:SAME_AS]-() | path]) = 1
  AND size([path = (md)-[:SAME_AS]-()   | path]) = 1
  AND size([path = (do)-[:SAME_AS]-()   | path]) = 2
// Output as Wikidata P486 (MeSH descriptor ID) triples
RETURN
  wdid.uri       AS subject,
  "http://www.wikidata.org/prop/direct/P486" AS predicate,
  n10s.rdf.getIRILocalName(md.uri) AS object

Exploring a Single Disease Lineage

// Full ancestry path for "anaerobic cellulitis" in Wikidata
MATCH taxonomy = (v:WD_Disease)-[:HAS_PARENT*]->(root)
WHERE v.label = "anaerobic cellulitis"
  AND NOT (root)-[:HAS_PARENT]->()
RETURN taxonomy

// Side-by-side view of Wikidata and MeSH paths for the same concept
MATCH taxonomy = (v:WD_Disease)-[:HAS_PARENT*]->(root)
WHERE v.label = "anaerobic cellulitis"
  AND NOT (root)-[:HAS_PARENT]->()
UNWIND nodes(taxonomy) AS node
MATCH mesh_twin =
  (node)-[:SAME_AS*0..1]->(:Mesh_Disease)-[:HAS_PARENT*0..]->(mesh_root:Mesh_Disease)
WHERE NOT (mesh_root)-[:HAS_PARENT]->()
RETURN taxonomy, mesh_twin

In Neo4j Bloom you can turn the disease name into a search phrase parameter — replace the hard-coded "anaerobic cellulitis" string with $disease_name and Bloom will prompt for the value interactively.

Key Concepts

Vocabulary mapping — Neosemantics n10s.mapping.add lets you rename RDF predicates and types at import time so that OWL’s rdfs:subClassOf becomes HAS_PARENT, creating a unified vocabulary across all three taxonomies without any post-processing. SAME_AS as the reconciliation edge — Storing cross-references as SAME_AS relationships (rather than string properties) turns reconciliation into a graph traversal problem, making Cypher pattern matching a natural fit for finding triangles, generalisations, and missing links. Incomplete triangles as data quality signals — Any configuration WD_Disease – DO_Disease – Mesh_Disease where one leg of the triangle is absent is a candidate enrichment: one taxonomy knows something the others do not, and that knowledge can be exported as new RDF triples.

All three SPARQL endpoints (Wikidata, MeSH) are publicly accessible without authentication. However, rate limits apply — if you experience timeouts, add LIMIT clauses to the CONSTRUCT queries and import in batches.

Resources

Neosemantics (n10s)

RDF and Linked Data integration for Neo4j

Wikidata SPARQL Endpoint

Interactive SPARQL query interface for Wikidata

MeSH SPARQL Endpoint

NLM Medical Subject Headings linked data service

Disease Ontology

OWL disease ontology maintained by the DO Consortium

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Reconcile Taxonomies from Multiple Sources in Neo4j

Watch Recording

Source Code

Overview

What You Will Learn

Setup

Step-by-Step Walkthrough

Exploring a Single Disease Lineage

Key Concepts

Resources

Neosemantics (n10s)

Wikidata SPARQL Endpoint

MeSH SPARQL Endpoint

Disease Ontology

Build docs developers (and LLMs) love

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Documentation Index

Watch Recording

Source Code

​Overview

​What You Will Learn

​Setup

​Step-by-Step Walkthrough

​Exploring a Single Disease Lineage

​Key Concepts

​Resources

Neosemantics (n10s)

Wikidata SPARQL Endpoint

MeSH SPARQL Endpoint

Disease Ontology

Build docs developers (and LLMs) love

Overview

What You Will Learn

Setup

Step-by-Step Walkthrough

Exploring a Single Disease Lineage

Key Concepts

Resources