Graph-Based Semantic Similarity Metrics in Taxonomies

Session 16 of Going Meta (broadcast May 2, 2023) explores how the structural properties of a taxonomy graph can be used to quantify how semantically similar two concepts are. Rather than relying on external NLP toolkits, the session demonstrates Neosemantics’ built-in similarity functions — n10s.sim.pathsim, n10s.sim.lchsim, and n10s.sim.wupsim — which implement the classic Wu-Palmer, Leacock-Chodorow, and path-based metrics directly in Cypher. The session covers a small automobile taxonomy first, then scales up to a Wikidata software concepts taxonomy used in earlier episodes.

Watch Recording

Full session recording on YouTube

Source Code

Cypher scripts and taxonomy files

Overview


Broadcast	May 2, 2023
Tags	`Python` `NLTK` `Semantics` `Taxonomy`
Similarity functions	`n10s.sim.pathsim`, `n10s.sim.lchsim`, `n10s.sim.wupsim`
Taxonomies used	VW automobile taxonomy · Wikidata software concepts

What You Will Learn

The theoretical basis for path-based, Wu-Palmer, and Leacock-Chodorow similarity metrics
How taxonomy depth and graph structure affect each metric differently
Computing all three metrics between two concept nodes in a single Cypher call
Visualising the shared-ancestor path between two concepts with n10s.sim.pathsim.path
How extending a taxonomy (adding deeper nodes) changes Leacock-Chodorow scores
Applying the same metrics to a large real-world SKOS taxonomy from Wikidata

The Three Metrics

Metric	Key idea	Sensitivity to depth
Path similarity	Inverse of shortest path length between two nodes	Low — depends only on distance
Wu-Palmer (WUP)	Ratio of depth of lowest common ancestor to sum of individual depths	Medium — considers ancestor position
Leacock-Chodorow (LCH)	Normalises path length by the maximum depth of the taxonomy	High — adding deeper nodes changes all scores

Step-by-Step Walkthrough

Initialise the graph for RDF import

Set up the uniqueness constraint and Neosemantics graph configuration before importing any data.

// Initialise graph for RDF import
CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE;

CALL n10s.graphconfig.init({ handleVocabUris: "IGNORE" });

Import the automobile taxonomy

A small OWL taxonomy of Volkswagen vehicle types (generated in Protégé) serves as the initial worked example.

// Import a simple taxonomy (Ontology generated with Protégé)
CALL n10s.onto.import.fetch(
  "https://raw.githubusercontent.com/jbarrasa/goingmeta/main/session16/taxonomies/vw.ttl",
  "Turtle"
)

Compute all three metrics between two sibling concepts

n10s.sim.pathsim.value, n10s.sim.lchsim.value, and n10s.sim.wupsim.value each take two Class nodes and return a floating-point score. Calling all three in the same RETURN clause gives a quick comparison.

MATCH (a:Class { name: "Electric" }), (b:Class { name: "Tiguan" })
RETURN
  n10s.sim.pathsim.value(a, b) AS path,
  n10s.sim.lchsim.value(a, b)  AS lch,
  n10s.sim.wupsim.value(a, b)  AS wup

Higher scores indicate greater similarity. A score of 1.0 from pathsim means the two nodes are identical. Values decrease as the two concepts become more distant in the taxonomy.

Visualise the shared path between two concepts

The .path variant of pathsim returns the traversal path through the lowest common ancestor, making it easy to understand which ancestor connects the two concepts and how many hops separate them.

MATCH (a:Class { name: "Golf" }), (b:Class { name: "Tiguan" })
RETURN n10s.sim.pathsim.path(a, b)

Compare concepts at different levels of the hierarchy

Exploring pairs at different depths shows how each metric handles distance and specificity differently.

// Comparing two mid-level categories
MATCH (a:Class { name: "Convertible" }), (b:Class { name: "SUV" })
RETURN n10s.sim.lchsim.value(a, b)

Extend the taxonomy and observe LCH sensitivity

Leacock-Chodorow normalises by the maximum depth of the taxonomy. Adding a deeper node changes the maximum depth and therefore shifts all existing LCH scores — even for concept pairs that were not modified.

// Add a deeper subclass to the taxonomy
CALL n10s.onto.import.inline('
  @prefix : <http://localhost/ontologies/2019/1/10/automobile#> .
  @prefix owl:  <http://www.w3.org/2002/07/owl#> .
  @prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
  @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

  :TiguanSpecial rdf:type owl:Class ;
          rdfs:subClassOf :Tiguan ;
          rdfs:label "Tiguan Special" .
', "Turtle")

After adding :TiguanSpecial, re-run the Convertible vs SUV LCH query from the previous step. The score will change because the maximum taxonomy depth has increased, even though neither of those two classes was modified. Wu-Palmer and path similarity are unaffected.

Scale up to a real-world SKOS taxonomy

The session concludes by applying the same metrics to the Wikidata software concepts taxonomy used in Session 2 — a much larger SKOS hierarchy that demonstrates how the metrics behave at scale.

// Clear the automobile taxonomy
MATCH (n:Resource) DETACH DELETE n

// Load the Wikidata software concepts SKOS taxonomy
CALL n10s.skos.import.fetch(
  "https://github.com/jbarrasa/goingmeta/raw/main/session02/resources/goingmeta-skos.ttl",
  "Turtle"
)

// Look up Neo4j (Q1628290) and MongoDB (Q1165204) by Wikidata identifier
MATCH (a:Class { name: "Q1628290" }), (b:Class { name: "Q1165204" })
RETURN n10s.sim.pathsim.path(a, b) AS path

// Or look up by human-friendly prefLabel
MATCH (neo:Class) WHERE neo.prefLabel CONTAINS "Neo4j"
MATCH (mdb:Class) WHERE mdb.prefLabel CONTAINS "Mongo"
RETURN
  n10s.sim.pathsim.value(neo, mdb) AS path,
  n10s.sim.lchsim.value(neo, mdb)  AS lch,
  n10s.sim.wupsim.value(neo, mdb)  AS wup

// Compare Neo4j vs Java — a more distant pair
MATCH (neo:Class) WHERE neo.prefLabel CONTAINS "Neo4j"
MATCH (j:Class)   WHERE j.prefLabel = "Java"
RETURN
  n10s.sim.pathsim.value(neo, j) AS path,
  n10s.sim.lchsim.value(neo, j)  AS lch,
  n10s.sim.wupsim.value(neo, j)  AS wup

Choosing a Metric

Path Similarity

Best when you only care about how many hops separate two concepts. Simple and fast; insensitive to where in the hierarchy the concepts sit.

Wu-Palmer (WUP)

Balances ancestor depth with path length. Rewards concepts that share a deep common ancestor, making it useful for fine-grained similarity tasks.

Leacock-Chodorow (LCH)

Normalises by taxonomy depth, making scores comparable across different taxonomies of different sizes. Sensitive to structural changes that alter overall depth.

NLTK comparison

The session compares these graph-native metrics against the equivalent NLTK WordNet functions, showing that graph-based approaches generalise to any domain taxonomy.

Key Concepts

Lowest common ancestor (LCA) — All three metrics rely on finding the LCA: the deepest node that is an ancestor of both concepts in the hierarchy. The richer and deeper the taxonomy, the more informative the LCA tends to be. Taxonomy depth sensitivity — Wu-Palmer and path similarity are local: they depend only on the path between the two queried concepts and their LCA. Leacock-Chodorow is global: it accounts for the maximum depth of the entire taxonomy, so structural changes elsewhere in the graph affect existing scores. SKOS via n10s.skos.import.fetch — SKOS concept schemes use skos:broader / skos:narrower rather than rdfs:subClassOf. Neosemantics’ n10s.skos.import.fetch maps SKOS predicates to the :SCO (subClassOf) relationship so that the same similarity functions work without modification.

The n10s.sim.* functions are available in Neosemantics 4.x and later. See the Neosemantics similarity documentation for the full function reference including edge cases such as comparing a concept to itself or to a direct ancestor.

Resources

Neosemantics Similarity Functions

Full reference for n10s.sim.* Cypher functions

NLTK Similarity Metrics

Wu-Palmer and LCH in NLTK’s WordNet interface

SKOS Reference

W3C Simple Knowledge Organization System specification

Session 2 — Semantic Search

The Wikidata software taxonomy used in this session’s scale-up exercise

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Graph-Based Semantic Similarity Metrics in Taxonomies

Watch Recording

Source Code

Overview

What You Will Learn

The Three Metrics

Step-by-Step Walkthrough

Choosing a Metric

Path Similarity

Wu-Palmer (WUP)

Leacock-Chodorow (LCH)

NLTK comparison

Key Concepts

Resources

Neosemantics Similarity Functions

NLTK Similarity Metrics

SKOS Reference

Session 2 — Semantic Search

Build docs developers (and LLMs) love

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Documentation Index

Watch Recording

Source Code

​Overview

​What You Will Learn

​The Three Metrics

​Step-by-Step Walkthrough

​Choosing a Metric

Path Similarity

Wu-Palmer (WUP)

Leacock-Chodorow (LCH)

NLTK comparison

​Key Concepts

​Resources

Neosemantics Similarity Functions

NLTK Similarity Metrics

SKOS Reference

Session 2 — Semantic Search

Build docs developers (and LLMs) love

Overview

What You Will Learn

The Three Metrics

Step-by-Step Walkthrough

Choosing a Metric

Key Concepts

Resources