Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neo4j-labs/neocarta/llms.txt

Use this file to discover all available pages before exploring further.

Neocarta ships two BigQuery connectors that complement each other. BigQuerySchemaConnector reads structural metadata — tables, columns, foreign keys, and sample values — from BigQuery Information Schema tables. BigQueryLogsConnector reads historical SQL queries from INFORMATION_SCHEMA.JOBS_BY_PROJECT, parses them, and maps the table and column usage patterns into the graph. Run both together for the most complete picture of your dataset.

BigQuerySchemaConnector

Extracts structural metadata for a BigQuery dataset and maps it to the Neocarta graph schema. Primary and foreign keys must be defined in the BigQuery Information Schema for REFERENCES edges to be created. What it extracts:
  • Database node representing the GCP project
  • Schema nodes for each dataset
  • Table nodes with descriptions
  • Column nodes with types, nullability, primary/foreign key flags, and descriptions
  • (:Column)-[:REFERENCES]->(:Column) edges from foreign key definitions
  • Value nodes with sampled unique column values

Import

from neocarta.connectors.bigquery import BigQuerySchemaConnector

Parameters

client
bigquery.Client
required
An authenticated BigQuery client. The client’s project attribute is used as the project ID when project_id is omitted.
project_id
str
required
GCP project ID. Falls back to client.project when not supplied explicitly.
neo4j_driver
neo4j.Driver
required
Connected Neo4j driver instance. The caller owns the driver; the connector does not close it.
database_name
str
default:"neo4j"
Target Neo4j database name.

ingest() Parameters

dataset_id
str
The BigQuery dataset to ingest. Pass it here rather than to the constructor.

Code Example

import os
from dotenv import load_dotenv
from google.cloud import bigquery
from neo4j import GraphDatabase
from neocarta.connectors.bigquery import BigQuerySchemaConnector

load_dotenv()

neo4j_driver = GraphDatabase.driver(
    uri=os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD")),
)
neo4j_database = os.getenv("NEO4J_DATABASE", "neo4j")
bigquery_client = bigquery.Client(project=os.getenv("GCP_PROJECT_ID"))

connector = BigQuerySchemaConnector(
    client=bigquery_client,
    project_id=os.getenv("GCP_PROJECT_ID"),
    neo4j_driver=neo4j_driver,
    database_name=neo4j_database,
)

connector.ingest(dataset_id=os.getenv("BIGQUERY_DATASET_ID"))

neo4j_driver.close()

CLI

pip install "neocarta[cli]"

neocarta bigquery schema \
  --project-id my-proj \
  --dataset-id sales

# With optional embeddings generation:
neocarta bigquery schema \
  --project-id my-proj \
  --dataset-id sales \
  --embeddings

Required Environment Variables

VariablePurpose
NEO4J_URINeo4j connection URI
NEO4J_USERNAMENeo4j username
NEO4J_PASSWORDNeo4j password
NEO4J_DATABASETarget Neo4j database (default: neo4j)
GCP_PROJECT_IDGCP project ID
BIGQUERY_DATASET_IDBigQuery dataset to ingest

BigQueryLogsConnector

Reads SQL queries from INFORMATION_SCHEMA.JOBS_BY_PROJECT, parses them to discover table and column usage, and loads query patterns into the graph. This reveals how your data is actually being used rather than relying solely on declared schema. What it extracts:
  • SQL queries from BigQuery job history
  • Tables and columns referenced in each query (via SQL parsing)
  • Join relationships between tables (from SQL JOIN clauses)
Graph additions:
Node / RelationshipProperties
Querycontent (query text), query_id (hash of content)
(:Query)-[:USES_TABLE]->(:Table)
(:Query)-[:USES_COLUMN]->(:Column)

Import

from neocarta.connectors.bigquery import BigQueryLogsConnector

Parameters

client
bigquery.Client
required
An authenticated BigQuery client.
project_id
str
required
GCP project ID.
neo4j_driver
neo4j.Driver
required
Connected Neo4j driver instance.
database_name
str
default:"neo4j"
Target Neo4j database name.

ingest() Parameters

dataset_id
str
required
The BigQuery dataset to filter queries by.
region
str
default:"region-us"
The BigQuery region for INFORMATION_SCHEMA.JOBS_BY_PROJECT.
start_timestamp
str
Optional ISO-8601 start of the query window, e.g. "2024-01-01 00:00:00".
end_timestamp
str
Optional ISO-8601 end of the query window, e.g. "2024-01-31 23:59:59".
limit
int
default:"100"
Maximum number of queries to extract.
drop_failed_queries
bool
default:"true"
Whether to exclude failed queries from the extract.

Code Example

import os
from dotenv import load_dotenv
from google.cloud import bigquery
from neo4j import GraphDatabase
from neocarta.connectors.bigquery import BigQueryLogsConnector

load_dotenv()

neo4j_driver = GraphDatabase.driver(
    uri=os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD")),
)
neo4j_database = os.getenv("NEO4J_DATABASE", "neo4j")
bigquery_client = bigquery.Client(project=os.getenv("GCP_PROJECT_ID"))

connector = BigQueryLogsConnector(
    client=bigquery_client,
    project_id=os.getenv("GCP_PROJECT_ID"),
    neo4j_driver=neo4j_driver,
    database_name=neo4j_database,
)

connector.ingest(
    dataset_id=os.getenv("BIGQUERY_DATASET_ID"),
    region="region-us",
    start_timestamp="2024-01-01 00:00:00",   # optional
    end_timestamp="2024-01-31 23:59:59",     # optional
    limit=500,
    drop_failed_queries=True,
)

neo4j_driver.close()

CLI

neocarta bigquery logs \
  --dataset-id sales \
  --limit 500

# Output as JSON:
neocarta bigquery logs \
  --dataset-id sales \
  --limit 500 \
  --json

Required Environment Variables

VariablePurpose
NEO4J_URINeo4j connection URI
NEO4J_USERNAMENeo4j username
NEO4J_PASSWORDNeo4j password
NEO4J_DATABASETarget Neo4j database (default: neo4j)
GCP_PROJECT_IDGCP project ID
BIGQUERY_DATASET_IDBigQuery dataset to filter queries by

Combining Both Connectors

Run BigQuerySchemaConnector first to populate the schema graph, then BigQueryLogsConnector to layer in query usage patterns. The logs connector MERGEs schema nodes it discovers, so it works standalone — but with schema loaded first, the USES_TABLE and USES_COLUMN edges connect to fully enriched nodes rather than skeleton stubs.
import os
from google.cloud import bigquery
from neo4j import GraphDatabase
from neocarta.connectors.bigquery import BigQuerySchemaConnector, BigQueryLogsConnector

driver = GraphDatabase.driver(
    uri=os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD")),
)
client = bigquery.Client(project=os.getenv("GCP_PROJECT_ID"))
dataset_id = os.getenv("BIGQUERY_DATASET_ID")
project_id = os.getenv("GCP_PROJECT_ID")

# 1. Load schema metadata (tables, columns, foreign keys, sample values)
BigQuerySchemaConnector(
    client=client,
    project_id=project_id,
    neo4j_driver=driver,
).ingest(dataset_id=dataset_id)

# 2. Layer in historical query usage patterns
BigQueryLogsConnector(
    client=client,
    project_id=project_id,
    neo4j_driver=driver,
).ingest(dataset_id=dataset_id, limit=500)

driver.close()

Build docs developers (and LLMs) love