Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jbarrasa/goingmeta/llms.txt

Use this file to discover all available pages before exploring further.

Session 9 of Going Meta, broadcast on October 4, 2022, tackles the challenge of building a knowledge graph without any manual labeling. Jesus presents an end-to-end pipeline orchestrated with Prefect that scrapes web content, automatically tags entities using Wikidata lookups, and loads structured facts into Neo4j — all without a human annotator in the loop. The session also introduces graph observability: monitoring the growth, coverage, and health of your KG over time as new content flows in.

What You Will Learn

  • How to design a Prefect flow that coordinates scraping, entity recognition, and graph loading as discrete tasks
  • How Wikidata’s API can serve as a zero-shot entity linker, matching surface forms to canonical URIs
  • How to structure scraped content as Neo4j nodes and wire them together with typed relationships
  • What graph observability means in practice: tracking KG growth, entity coverage, and schema drift across pipeline runs
  • How unsupervised construction compares to supervised NLP pipelines in terms of precision, recall, and maintenance cost
Tags: Orchestration · Prefect · Wikidata — Broadcast October 4, 2022
This session has no code files in the repository — only the session PDF is available. The code snippets below are illustrative examples based on the patterns described in the session slides. They are provided for reference and may require adaptation for your environment.

Core Concepts

Prefect Orchestration

A Prefect flow wraps the full pipeline as a Python function decorated with @flow. Individual steps — fetch URLs, extract text, tag entities, write to Neo4j — become @task functions that Prefect can schedule, retry, and monitor independently.
Flow
 ├── Task: scrape_article(url) → raw_text
 ├── Task: extract_entities(raw_text) → entity_mentions
 ├── Task: link_to_wikidata(entity_mentions) → uri_map
 └── Task: load_to_neo4j(uri_map, raw_text) → graph_delta

Wikidata Entity Linking

Instead of training a named-entity recogniser, the pipeline sends candidate strings to the Wikidata Entity Search API. Each mention that resolves to a Wikidata item receives a canonical URI (e.g., http://www.wikidata.org/entity/Q42) which is then used as the node identifier in Neo4j — giving every entity a globally unique, dereferenceable identity from day one.

Graph Observability

After each pipeline run, a set of monitoring queries measure:
MetricWhat it tracks
Node count deltaHow many new entities were added
Relationship densityAverage degree growth per entity type
Coverage ratioFraction of article mentions successfully linked
Schema driftNew labels or relationship types introduced
Graph observability is the knowledge-graph equivalent of data pipeline monitoring. Tracking coverage and density over time helps you detect when the source content changes shape before it silently degrades query quality.

Pipeline Design Walkthrough

1

Define the Prefect flow skeleton

Each task is independently retried and logged by Prefect, making the pipeline robust to transient scraping or API failures.
from prefect import flow, task

@task
def scrape_article(url: str) -> str:
    # fetch page HTML, extract main text
    ...

@task
def link_to_wikidata(mentions: list[str]) -> dict:
    # call Wikidata search API for each mention
    ...

@task
def load_to_neo4j(uri_map: dict, source_url: str):
    # MERGE nodes and relationships into Neo4j
    ...

@flow
def build_kg(urls: list[str]):
    for url in urls:
        text = scrape_article(url)
        entities = link_to_wikidata(extract_mentions(text))
        load_to_neo4j(entities, url)
2

Merge entities with canonical URIs

Using MERGE on the Wikidata URI ensures that the same real-world entity appearing in multiple articles is represented by a single node in Neo4j.
MERGE (e:Entity { uri: $wikidataUri })
SET e.label = $label, e.wikidataId = $qid
MERGE (a:Article { url: $sourceUrl })
MERGE (a)-[:MENTIONS]->(e)
3

Monitor KG growth after each run

Run these observability queries after each Prefect flow execution to track pipeline health.
// Total entity and article counts
MATCH (e:Entity) RETURN count(e) AS entityCount
MATCH (a:Article) RETURN count(a) AS articleCount

// Average mentions per article (coverage density)
MATCH (a:Article)-[:MENTIONS]->(e:Entity)
RETURN avg(size((a)-[:MENTIONS]->())) AS avgMentionsPerArticle

// Entities with the most cross-article coverage
MATCH (e:Entity)<-[:MENTIONS]-(a:Article)
RETURN e.label, count(a) AS articleCount
ORDER BY articleCount DESC LIMIT 10

Resources

Watch the Recording

Full live-stream on YouTube — Session 9, October 4 2022

Session PDF on GitHub

Session slides in PDF format — no code files are present in the repository for this session

Build docs developers (and LLMs) love