Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jbarrasa/goingmeta/llms.txt

Use this file to discover all available pages before exploring further.

Session 6 of Going Meta, broadcast on July 5, 2022, explores a bottom-up approach to ontology engineering: instead of designing a schema upfront, you let the data speak for itself. By analyzing co-occurrence patterns between genre nodes in a book dataset, Jesus demonstrates how graph algorithms and simple overlap metrics can automatically surface equivalent categories, subgenres, and multi-level taxonomies — all stored directly in Neo4j as first-class relationships.

What You Will Learn

  • How to load a real-world CSV dataset (books + genres + authors) into Neo4j
  • How to profile a graph to understand connectivity before running algorithms
  • How to use the GDS Node Similarity algorithm to detect similar genre nodes
  • How to compute a directional co-occurrence score (the “Barrasa Algo”) to distinguish equivalent genres from subgenres
  • How to materialize inferred narrower_than relationships and prune redundant transitive shortcuts
  • How to explore the resulting taxonomy as a tree of depth 3 or more
Tags: Graph Algos · ML · Ontologies — Broadcast July 5, 2022

Loading the Dataset

The session begins by importing a 2,000-book CSV, building indexes, and linking each book to its genres and authors.
// indexes
CREATE INDEX ON :Author(name);
CREATE INDEX ON :Book(id);
CREATE INDEX ON :Genre(name);

// import from csv
LOAD CSV WITH HEADERS FROM "https://github.com/jbarrasa/goingmeta/raw/main/session06/data/books-2000.csv" AS row
MERGE (b:Book { id : row.itemUrl})
SET b.description = row.description, b.title = row.itemTitle
WITH b, row
UNWIND split(row.genres,';') AS genre
MERGE (g:Genre { name: substring(genre,8)})
MERGE (b)-[:HAS_GENRE]->(g)
WITH b, row
UNWIND split(row.author,';') AS author
MERGE (a:Author { name: author})
MERGE (b)-[:HAS_AUTHOR]->(a)
Before running any algorithms, it’s worth profiling how genres are distributed across books:
MATCH (n:Book)
WITH id(n) AS bookid, size((n)-[:HAS_GENRE]->()) AS genreCount
RETURN AVG(genreCount) AS avgNumGenres, MAX(genreCount) AS maxNumGenres, MIN(genreCount) AS minNumGenres

Approach 1 — GDS Node Similarity

1

Create a named graph projection

Project Book and Genre nodes with reversed HAS_GENRE edges so the algorithm can compare genres by shared books.
CALL gds.graph.project(
    'genreSimGraph',
    ['Book', 'Genre'],
    {
        genre: {
            type: 'HAS_GENRE', orientation: 'REVERSE'
        }
    }
);
2

Stream similarity scores

Run the Node Similarity algorithm and return the most similar genre pairs ranked by score.
CALL gds.nodeSimilarity.stream('genreSimGraph')
YIELD node1, node2, similarity
RETURN gds.util.asNode(node1).name AS Genre1, gds.util.asNode(node2).name AS Genre2, similarity
ORDER BY similarity DESCENDING, Genre1, Genre2
3

Materialize a similarity relationship

Write a similar_to relationship between genres whose similarity score meets your threshold.
MATCH (g1:Genre { name: "money"}), (g2:Genre { name: "entrepreneurship"})
MERGE (g1)-[:similar_to]-(g2)

Approach 2 — Co-occurrence Scoring

The GDS approach treats similarity as symmetric. A custom co-occurrence score can reveal directional subsumption — i.e., which genre is a subgenre of another.

Compute Directional Co-occurrence

MATCH (g1:Genre)<-[:HAS_GENRE]-(:Book)-[:HAS_GENRE]->(g2:Genre)
WITH DISTINCT g1, g2 WHERE id(g1) < id(g2)
WITH g1, g2,
     size((g1)<-[:HAS_GENRE]-()) AS degree1,
     size((g2)<-[:HAS_GENRE]-()) AS degree2,
     size((g1)<-[:HAS_GENRE]-()-[:HAS_GENRE]->(g2)) AS overlap
MERGE (g1)-[:COOC { score: overlap * 1.0 / degree1 }]->(g2)
MERGE (g2)-[:COOC { score: overlap * 1.0 / degree2 }]->(g1)
A COOC score of 1 in both directions means every book that has g1 also has g2 and vice versa — the genres are equivalent. A score of 1 in only one direction signals a subgenre relationship.

Identify Equivalent and Narrower Genres

-- Equivalent genres (score = 1 in both directions)
MATCH (g1:Genre)-[:COOC { score: 1 }]->(g2:Genre)-[:COOC { score: 1 }]->(g1)
RETURN g1.name, g2.name LIMIT 10

-- Subgenres (score = 1 only outward)
MATCH (g1:Genre)-[:COOC { score: 1 }]->(g2:Genre)-[c:COOC]->(g1)
WHERE c.score < 1
RETURN g1.name, g2.name, c.score AS sc ORDER BY sc DESC LIMIT 100

Materialize the Taxonomy and Prune Shortcuts

-- Create narrower_than relationships
MATCH (g1:Genre)-[:COOC { score: 1 }]->(g2:Genre)-[c:COOC]->(g1)
WHERE c.score < 1
MERGE (g1)-[:narrower_than]->(g2)

-- Remove transitive shortcuts to keep the hierarchy clean
MATCH (g1)-[:narrower_than*2..]->(g3),
      (g1)-[d:narrower_than]->(g3)
DELETE d

-- Explore the resulting taxonomy (depth3)
MATCH taxonomy = (:Genre)-[:narrower_than*3..]->()
RETURN taxonomy LIMIT 3
After pruning shortcuts, the narrower_than graph becomes a proper tree where each path represents a genuine taxonomic descent — no inherited or redundant edges remain.

Resources

Watch the Recording

Full live-stream on YouTube — Session 6, July 5 2022

Source Code on GitHub

Cypher queries, CSV dataset, and session materials

Build docs developers (and LLMs) love