Automatic Ontology Learning from Existing Graph Data

Session 6 of Going Meta, broadcast on July 5, 2022, explores a bottom-up approach to ontology engineering: instead of designing a schema upfront, you let the data speak for itself. By analyzing co-occurrence patterns between genre nodes in a book dataset, Jesus demonstrates how graph algorithms and simple overlap metrics can automatically surface equivalent categories, subgenres, and multi-level taxonomies — all stored directly in Neo4j as first-class relationships.

What You Will Learn

How to load a real-world CSV dataset (books + genres + authors) into Neo4j
How to profile a graph to understand connectivity before running algorithms
How to use the GDS Node Similarity algorithm to detect similar genre nodes
How to compute a directional co-occurrence score (the “Barrasa Algo”) to distinguish equivalent genres from subgenres
How to materialize inferred narrower_than relationships and prune redundant transitive shortcuts
How to explore the resulting taxonomy as a tree of depth 3 or more

Tags: Graph Algos · ML · Ontologies — Broadcast July 5, 2022

Loading the Dataset

The session begins by importing a 2,000-book CSV, building indexes, and linking each book to its genres and authors.

// indexes
CREATE INDEX ON :Author(name);
CREATE INDEX ON :Book(id);
CREATE INDEX ON :Genre(name);

// import from csv
LOAD CSV WITH HEADERS FROM "https://github.com/jbarrasa/goingmeta/raw/main/session06/data/books-2000.csv" AS row
MERGE (b:Book { id : row.itemUrl})
SET b.description = row.description, b.title = row.itemTitle
WITH b, row
UNWIND split(row.genres,';') AS genre
MERGE (g:Genre { name: substring(genre,8)})
MERGE (b)-[:HAS_GENRE]->(g)
WITH b, row
UNWIND split(row.author,';') AS author
MERGE (a:Author { name: author})
MERGE (b)-[:HAS_AUTHOR]->(a)

Before running any algorithms, it’s worth profiling how genres are distributed across books:

MATCH (n:Book)
WITH id(n) AS bookid, size((n)-[:HAS_GENRE]->()) AS genreCount
RETURN AVG(genreCount) AS avgNumGenres, MAX(genreCount) AS maxNumGenres, MIN(genreCount) AS minNumGenres

Approach 1 — GDS Node Similarity

Create a named graph projection

Project Book and Genre nodes with reversed HAS_GENRE edges so the algorithm can compare genres by shared books.

CALL gds.graph.project(
    'genreSimGraph',
    ['Book', 'Genre'],
    {
        genre: {
            type: 'HAS_GENRE', orientation: 'REVERSE'
        }
    }
);

Stream similarity scores

Run the Node Similarity algorithm and return the most similar genre pairs ranked by score.

CALL gds.nodeSimilarity.stream('genreSimGraph')
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).name AS Genre1, gds.util.asNode(node2).name AS Genre2, similarity
ORDER BY similarity DESCENDING, Genre1, Genre2

Materialize a similarity relationship

Write a similar_to relationship between genres whose similarity score meets your threshold.

MATCH (g1:Genre { name: "money"}), (g2:Genre { name: "entrepreneurship"})
MERGE (g1)-[:similar_to]-(g2)

Approach 2 — Co-occurrence Scoring

The GDS approach treats similarity as symmetric. A custom co-occurrence score can reveal directional subsumption — i.e., which genre is a subgenre of another.

Compute Directional Co-occurrence

MATCH (g1:Genre)<-[:HAS_GENRE]-(:Book)-[:HAS_GENRE]->(g2:Genre)
WITH DISTINCT g1, g2 WHERE id(g1) < id(g2)
WITH g1, g2,
     size((g1)<-[:HAS_GENRE]-()) AS degree1,
     size((g2)<-[:HAS_GENRE]-()) AS degree2,
     size((g1)<-[:HAS_GENRE]-()-[:HAS_GENRE]->(g2)) AS overlap
MERGE (g1)-[:COOC { score: overlap * 1.0 / degree1 }]->(g2)
MERGE (g2)-[:COOC { score: overlap * 1.0 / degree2 }]->(g1)

A COOC score of 1 in both directions means every book that has g1 also has g2 and vice versa — the genres are equivalent. A score of 1 in only one direction signals a subgenre relationship.

Identify Equivalent and Narrower Genres

-- Equivalent genres (score = 1 in both directions)
MATCH (g1:Genre)-[:COOC { score: 1 }]->(g2:Genre)-[:COOC { score: 1 }]->(g1)
RETURN g1.name, g2.name LIMIT 10

-- Subgenres (score = 1 only outward)
MATCH (g1:Genre)-[:COOC { score: 1 }]->(g2:Genre)-[c:COOC]->(g1)
WHERE c.score < 1
RETURN g1.name, g2.name, c.score AS sc ORDER BY sc DESC LIMIT 100

Materialize the Taxonomy and Prune Shortcuts

-- Create narrower_than relationships
MATCH (g1:Genre)-[:COOC { score: 1 }]->(g2:Genre)-[c:COOC]->(g1)
WHERE c.score < 1
MERGE (g1)-[:narrower_than]->(g2)

-- Remove transitive shortcuts to keep the hierarchy clean
MATCH (g1)-[:narrower_than*2..]->(g3),
      (g1)-[d:narrower_than]->(g3)
DELETE d

-- Explore the resulting taxonomy (depth ≥ 3)
MATCH taxonomy = (:Genre)-[:narrower_than*3..]->()
RETURN taxonomy LIMIT 3

After pruning shortcuts, the narrower_than graph becomes a proper tree where each path represents a genuine taxonomic descent — no inherited or redundant edges remain.

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Automatic Ontology Learning from Existing Graph Data

What You Will Learn

Loading the Dataset

Approach 1 — GDS Node Similarity

Approach 2 — Co-occurrence Scoring

Compute Directional Co-occurrence

Identify Equivalent and Narrower Genres

Materialize the Taxonomy and Prune Shortcuts

Resources

Watch the Recording

Source Code on GitHub

Build docs developers (and LLMs) love

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Documentation Index

​What You Will Learn

​Loading the Dataset

​Approach 1 — GDS Node Similarity

​Approach 2 — Co-occurrence Scoring

​Compute Directional Co-occurrence

​Identify Equivalent and Narrower Genres

​Materialize the Taxonomy and Prune Shortcuts

​Resources

Watch the Recording

Source Code on GitHub

Build docs developers (and LLMs) love

What You Will Learn

Loading the Dataset

Approach 1 — GDS Node Similarity

Approach 2 — Co-occurrence Scoring

Compute Directional Co-occurrence

Identify Equivalent and Narrower Genres

Materialize the Taxonomy and Prune Shortcuts

Resources