Unsupervised Knowledge Graph Construction with Prefect

Session 9 of Going Meta, broadcast on October 4, 2022, tackles the challenge of building a knowledge graph without any manual labeling. Jesus presents an end-to-end pipeline orchestrated with Prefect that scrapes web content, automatically tags entities using Wikidata lookups, and loads structured facts into Neo4j — all without a human annotator in the loop. The session also introduces graph observability: monitoring the growth, coverage, and health of your KG over time as new content flows in.

What You Will Learn

How to design a Prefect flow that coordinates scraping, entity recognition, and graph loading as discrete tasks
How Wikidata’s API can serve as a zero-shot entity linker, matching surface forms to canonical URIs
How to structure scraped content as Neo4j nodes and wire them together with typed relationships
What graph observability means in practice: tracking KG growth, entity coverage, and schema drift across pipeline runs
How unsupervised construction compares to supervised NLP pipelines in terms of precision, recall, and maintenance cost

Tags: Orchestration · Prefect · Wikidata — Broadcast October 4, 2022

This session has no code files in the repository — only the session PDF is available. The code snippets below are illustrative examples based on the patterns described in the session slides. They are provided for reference and may require adaptation for your environment.

Core Concepts

Prefect Orchestration

A Prefect flow wraps the full pipeline as a Python function decorated with @flow. Individual steps — fetch URLs, extract text, tag entities, write to Neo4j — become @task functions that Prefect can schedule, retry, and monitor independently.

Flow
 ├── Task: scrape_article(url) → raw_text
 ├── Task: extract_entities(raw_text) → entity_mentions
 ├── Task: link_to_wikidata(entity_mentions) → uri_map
 └── Task: load_to_neo4j(uri_map, raw_text) → graph_delta

Wikidata Entity Linking

Instead of training a named-entity recogniser, the pipeline sends candidate strings to the Wikidata Entity Search API. Each mention that resolves to a Wikidata item receives a canonical URI (e.g., http://www.wikidata.org/entity/Q42) which is then used as the node identifier in Neo4j — giving every entity a globally unique, dereferenceable identity from day one.

Graph Observability

After each pipeline run, a set of monitoring queries measure:

Metric	What it tracks
Node count delta	How many new entities were added
Relationship density	Average degree growth per entity type
Coverage ratio	Fraction of article mentions successfully linked
Schema drift	New labels or relationship types introduced

Graph observability is the knowledge-graph equivalent of data pipeline monitoring. Tracking coverage and density over time helps you detect when the source content changes shape before it silently degrades query quality.

Pipeline Design Walkthrough

Define the Prefect flow skeleton

Each task is independently retried and logged by Prefect, making the pipeline robust to transient scraping or API failures.

from prefect import flow, task

@task
def scrape_article(url: str) -> str:
    # fetch page HTML, extract main text
    ...

@task
def link_to_wikidata(mentions: list[str]) -> dict:
    # call Wikidata search API for each mention
    ...

@task
def load_to_neo4j(uri_map: dict, source_url: str):
    # MERGE nodes and relationships into Neo4j
    ...

@flow
def build_kg(urls: list[str]):
    for url in urls:
        text = scrape_article(url)
        entities = link_to_wikidata(extract_mentions(text))
        load_to_neo4j(entities, url)

Merge entities with canonical URIs

Using MERGE on the Wikidata URI ensures that the same real-world entity appearing in multiple articles is represented by a single node in Neo4j.

MERGE (e:Entity { uri: $wikidataUri })
SET e.label = $label, e.wikidataId = $qid
MERGE (a:Article { url: $sourceUrl })
MERGE (a)-[:MENTIONS]->(e)

Monitor KG growth after each run

Run these observability queries after each Prefect flow execution to track pipeline health.

// Total entity and article counts
MATCH (e:Entity) RETURN count(e) AS entityCount
MATCH (a:Article) RETURN count(a) AS articleCount

// Average mentions per article (coverage density)
MATCH (a:Article)-[:MENTIONS]->(e:Entity)
RETURN avg(size((a)-[:MENTIONS]->())) AS avgMentionsPerArticle

// Entities with the most cross-article coverage
MATCH (e:Entity)<-[:MENTIONS]-(a:Article)
RETURN e.label, count(a) AS articleCount
ORDER BY articleCount DESC LIMIT 10

Resources

Watch the Recording

Full live-stream on YouTube — Session 9, October 4 2022

Session PDF on GitHub

Session slides in PDF format — no code files are present in the repository for this session

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Unsupervised Knowledge Graph Construction with Prefect

What You Will Learn

Core Concepts

Prefect Orchestration

Wikidata Entity Linking

Graph Observability

Pipeline Design Walkthrough

Resources

Watch the Recording

Session PDF on GitHub

Build docs developers (and LLMs) love

Foundations (2022)

Intermediate Topics (2022)

Advanced Patterns (2023)

LLM Integration (2023–2024)

Documentation Index

​What You Will Learn

​Core Concepts

​Prefect Orchestration

​Wikidata Entity Linking

​Graph Observability

​Pipeline Design Walkthrough

​Resources

Watch the Recording

Session PDF on GitHub

Build docs developers (and LLMs) love

What You Will Learn

Core Concepts

Prefect Orchestration

Wikidata Entity Linking

Graph Observability

Pipeline Design Walkthrough

Resources