Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jbarrasa/goingmeta/llms.txt

Use this file to discover all available pages before exploring further.

Session 11 of Going Meta, broadcast on December 6, 2022, closes Season 1 with a fresh take on graph data quality. Inspired by the Great Expectations framework for tabular data, Jesus introduces the graphexpectations Python library: a way to express data quality rules as readable Python method calls, execute them against a live Neo4j database, and automatically serialize the results as SHACL shapes for version control and cross-tool compatibility.

What You Will Learn

  • How to install and import the graphexpectations package
  • How to define expectations per node type: regex patterns, value ranges, cardinality constraints, and type-checking
  • How to target a subset of nodes using a Cypher pattern (query-based selection) rather than a label
  • How to combine multiple expectation sets into a reusable Suite
  • How to serialize a suite as SHACL (Turtle) for storage and interoperability with other tools
  • How to bind the suite to a running Neo4j database and execute all validations in one call
  • How to visualize the violation breakdown with Plotly
Tags: Python · Data Quality · SHACL — Broadcast December 6, 2022

The Five-Step Workflow

1

Install dependencies and import the library

# pip install neo4j rdflib
# pip install -i https://test.pypi.org/simple/ graphexpectations

import graphexpectations as ge
import pandas as pd
2

Define expectation sets per node type

Each ge.Set targets either a node label or a Cypher graph pattern. Methods express the rules in plain English-style API calls.
supplierExpectations = ge.Set(nodeType="Supplier")
supplierExpectations.expect_property_values_to_match_regex(
    property="country", regex="^[A-Za-z]+$", message="R001_INVALID_COUNTRY"
)
supplierExpectations.expect_number_of_incoming_relationship_to_be_between(
    relationship="supplied_by", min=2, message="R002_LOW_PRODUCT_OFFERING"
)

productExpectations = ge.Set("Product")
productExpectations.expect_number_of_property_values_to_be_between(
    property="unitPrice", min=1, max=1, message="R003_SINGLE_PRICE"
)
productExpectations.expect_property_values_to_be_between(
    property="unitPrice", minInclusive=10, maxExclusive=500, message="R004_PRICE_LIMIT"
)

customerExpectations = ge.Set("Customer")
customerExpectations.expect_outgoing_relationship_to_connect_to_nodes_of_type(
    relationship="places", targetType="Order", message="R005_CUST_BAD_SCHEMA"
)
customerExpectations.expect_number_of_outgoing_relationship_to_be_between(
    relationship="places", min="1", message="R006_CUST_NO_ORDERS"
)

# Query-based selection: only US-supplied products
americanProducts = ge.Set(query=" (focus:Product)-[:supplied_by]->(:Supplier { country: 'USA' }) ")
americanProducts.expect_property_values_to_be_between(
    property="productID", minExclusive=10, message="R007_US_PROD_ID"
)
3

Combine expectation sets into a Suite

s = ge.Suite(desc="suite of expectations for my Neo4j Northwind KG")
s.add_expectations([supplierExpectations, productExpectations, customerExpectations, americanProducts])
4

Serialize to SHACL for version control

Call s.serialise() to get a Turtle-formatted SHACL document that encodes every expectation as a node shape — ready to commit to git or share with SHACL-native tools.
print(s.serialise())
Example output (abbreviated):
@prefix ns1: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[] a ns1:NodeShape ;
    ns1:property [ ns1:maxExclusive 500 ;
            ns1:message "R004_PRICE_LIMIT" ;
            ns1:minInclusive 10 ;
            ns1:path <neo4j://graph.schema#unitPrice> ],
        [ ns1:maxCount 1 ;
            ns1:message "R003_SINGLE_PRICE" ;
            ns1:minCount 1 ;
            ns1:path <neo4j://graph.schema#unitPrice> ] ;
    ns1:targetClass <neo4j://graph.schema#Product> .

[] a ns1:NodeShape ;
    ns1:property [ ns1:message "R007_US_PROD_ID" ;
            ns1:minExclusive 10 ;
            ns1:path <neo4j://graph.schema#productID> ] ;
    ns1:targetQuery " (focus:Product)-[:supplied_by]->(:Supplier { country: 'USA' }) " .
5

Bind to a database, run, and visualize results

Bind the suite to a live Neo4j instance, run all validations, load the violations into a pandas DataFrame, and plot the breakdown.
context = s.bind_to_db("bolt://<host>:7687", "neo4j", "<password>")

df = pd.DataFrame(context.run())

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

display(df[['node','nodeType','violationType','offendingValue','schemaElement','comment','msg']])

# Visualize violation counts per rule
import plotly
pd.options.plotting.backend = "plotly"
aggregate = df[["msg","node"]].groupby(["msg"]).count().rename(columns={'node':'violation_count'})
fig = aggregate.plot.bar()
fig.show()

graphexpectations vs. SHACL Directly

graphexpectations

Python-native API, readable method names, integrates with pandas and Plotly for reporting, generates SHACL automatically as an output artifact.

SHACL with n10s

Standards-based, portable across any SHACL-compliant engine, directly supported in Neo4j via the n10s n10s.validation.shacl.validate procedure.
The two approaches are complementary: use graphexpectations when you want a developer-friendly authoring experience and use the serialized SHACL output when you need to share rules with other systems or enforce them at the RDF layer.

Resources

Watch the Recording

Full live-stream on YouTube — Session 11, December 6 2022

Source Code on GitHub

Colab notebook, graphexpectations examples, and SHACL output

Build docs developers (and LLMs) love