BigQuery Schema and Query Log Connectors

Neocarta ships two BigQuery connectors that complement each other. BigQuerySchemaConnector reads structural metadata — tables, columns, foreign keys, and sample values — from BigQuery Information Schema tables. BigQueryLogsConnector reads historical SQL queries from INFORMATION_SCHEMA.JOBS_BY_PROJECT, parses them, and maps the table and column usage patterns into the graph. Run both together for the most complete picture of your dataset.

BigQuerySchemaConnector

Extracts structural metadata for a BigQuery dataset and maps it to the Neocarta graph schema. Primary and foreign keys must be defined in the BigQuery Information Schema for REFERENCES edges to be created. What it extracts:

Database node representing the GCP project
Schema nodes for each dataset
Table nodes with descriptions
Column nodes with types, nullability, primary/foreign key flags, and descriptions
(:Column)-[:REFERENCES]->(:Column) edges from foreign key definitions
Value nodes with sampled unique column values

Import

from neocarta.connectors.bigquery import BigQuerySchemaConnector

Parameters

client

bigquery.Client

required

An authenticated BigQuery client. The client’s project attribute is used as the project ID when project_id is omitted.

project_id

str

required

GCP project ID. Falls back to client.project when not supplied explicitly.

neo4j_driver

neo4j.Driver

required

Connected Neo4j driver instance. The caller owns the driver; the connector does not close it.

database_name

str

default:"neo4j"

Target Neo4j database name.

`ingest()` Parameters

dataset_id

str

The BigQuery dataset to ingest. Pass it here rather than to the constructor.

Code Example

import os
from dotenv import load_dotenv
from google.cloud import bigquery
from neo4j import GraphDatabase
from neocarta.connectors.bigquery import BigQuerySchemaConnector

load_dotenv()

neo4j_driver = GraphDatabase.driver(
    uri=os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD")),
)
neo4j_database = os.getenv("NEO4J_DATABASE", "neo4j")
bigquery_client = bigquery.Client(project=os.getenv("GCP_PROJECT_ID"))

connector = BigQuerySchemaConnector(
    client=bigquery_client,
    project_id=os.getenv("GCP_PROJECT_ID"),
    neo4j_driver=neo4j_driver,
    database_name=neo4j_database,
)

connector.ingest(dataset_id=os.getenv("BIGQUERY_DATASET_ID"))

neo4j_driver.close()

CLI

pip install "neocarta[cli]"

neocarta bigquery schema \
  --project-id my-proj \
  --dataset-id sales

# With optional embeddings generation:
neocarta bigquery schema \
  --project-id my-proj \
  --dataset-id sales \
  --embeddings

Required Environment Variables

Variable	Purpose
`NEO4J_URI`	Neo4j connection URI
`NEO4J_USERNAME`	Neo4j username
`NEO4J_PASSWORD`	Neo4j password
`NEO4J_DATABASE`	Target Neo4j database (default: `neo4j`)
`GCP_PROJECT_ID`	GCP project ID
`BIGQUERY_DATASET_ID`	BigQuery dataset to ingest

BigQueryLogsConnector

Reads SQL queries from INFORMATION_SCHEMA.JOBS_BY_PROJECT, parses them to discover table and column usage, and loads query patterns into the graph. This reveals how your data is actually being used rather than relying solely on declared schema. What it extracts:

SQL queries from BigQuery job history
Tables and columns referenced in each query (via SQL parsing)
Join relationships between tables (from SQL JOIN clauses)

Graph additions:

Node / Relationship	Properties
`Query`	`content` (query text), `query_id` (hash of content)
`(:Query)-[:USES_TABLE]->(:Table)`	—
`(:Query)-[:USES_COLUMN]->(:Column)`	—

Import

from neocarta.connectors.bigquery import BigQueryLogsConnector

Parameters

client

bigquery.Client

required

An authenticated BigQuery client.

project_id

str

required

GCP project ID.

neo4j_driver

neo4j.Driver

required

Connected Neo4j driver instance.

database_name

str

default:"neo4j"

Target Neo4j database name.

`ingest()` Parameters

dataset_id

str

required

The BigQuery dataset to filter queries by.

region

str

default:"region-us"

The BigQuery region for INFORMATION_SCHEMA.JOBS_BY_PROJECT.

start_timestamp

str

Optional ISO-8601 start of the query window, e.g. "2024-01-01 00:00:00".

end_timestamp

str

Optional ISO-8601 end of the query window, e.g. "2024-01-31 23:59:59".

limit

int

default:"100"

Maximum number of queries to extract.

drop_failed_queries

bool

default:"true"

Whether to exclude failed queries from the extract.

Code Example

import os
from dotenv import load_dotenv
from google.cloud import bigquery
from neo4j import GraphDatabase
from neocarta.connectors.bigquery import BigQueryLogsConnector

load_dotenv()

neo4j_driver = GraphDatabase.driver(
    uri=os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD")),
)
neo4j_database = os.getenv("NEO4J_DATABASE", "neo4j")
bigquery_client = bigquery.Client(project=os.getenv("GCP_PROJECT_ID"))

connector = BigQueryLogsConnector(
    client=bigquery_client,
    project_id=os.getenv("GCP_PROJECT_ID"),
    neo4j_driver=neo4j_driver,
    database_name=neo4j_database,
)

connector.ingest(
    dataset_id=os.getenv("BIGQUERY_DATASET_ID"),
    region="region-us",
    start_timestamp="2024-01-01 00:00:00",   # optional
    end_timestamp="2024-01-31 23:59:59",     # optional
    limit=500,
    drop_failed_queries=True,
)

neo4j_driver.close()

CLI

neocarta bigquery logs \
  --dataset-id sales \
  --limit 500

# Output as JSON:
neocarta bigquery logs \
  --dataset-id sales \
  --limit 500 \
  --json

Required Environment Variables

Variable	Purpose
`NEO4J_URI`	Neo4j connection URI
`NEO4J_USERNAME`	Neo4j username
`NEO4J_PASSWORD`	Neo4j password
`NEO4J_DATABASE`	Target Neo4j database (default: `neo4j`)
`GCP_PROJECT_ID`	GCP project ID
`BIGQUERY_DATASET_ID`	BigQuery dataset to filter queries by

Combining Both Connectors

Run BigQuerySchemaConnector first to populate the schema graph, then BigQueryLogsConnector to layer in query usage patterns. The logs connector MERGEs schema nodes it discovers, so it works standalone — but with schema loaded first, the USES_TABLE and USES_COLUMN edges connect to fully enriched nodes rather than skeleton stubs.

import os
from google.cloud import bigquery
from neo4j import GraphDatabase
from neocarta.connectors.bigquery import BigQuerySchemaConnector, BigQueryLogsConnector

driver = GraphDatabase.driver(
    uri=os.getenv("NEO4J_URI"),
    auth=(os.getenv("NEO4J_USERNAME"), os.getenv("NEO4J_PASSWORD")),
)
client = bigquery.Client(project=os.getenv("GCP_PROJECT_ID"))
dataset_id = os.getenv("BIGQUERY_DATASET_ID")
project_id = os.getenv("GCP_PROJECT_ID")

# 1. Load schema metadata (tables, columns, foreign keys, sample values)
BigQuerySchemaConnector(
    client=client,
    project_id=project_id,
    neo4j_driver=driver,
).ingest(dataset_id=dataset_id)

# 2. Layer in historical query usage patterns
BigQueryLogsConnector(
    client=client,
    project_id=project_id,
    neo4j_driver=driver,
).ingest(dataset_id=dataset_id, limit=500)

driver.close()

Get Started

Connectors

Enrichment

MCP Server

CLI Reference

BigQuery Schema and Query Log Connectors

BigQuerySchemaConnector

Import

Parameters

`ingest()` Parameters

Code Example

CLI

Required Environment Variables

BigQueryLogsConnector

Import

Parameters

`ingest()` Parameters

Code Example

CLI

Required Environment Variables

Combining Both Connectors

Build docs developers (and LLMs) love

Get Started

Connectors

Enrichment

MCP Server

CLI Reference

Documentation Index

​BigQuerySchemaConnector

​Import

​Parameters

​ingest() Parameters

​Code Example

​CLI

​Required Environment Variables

​BigQueryLogsConnector

​Import

​Parameters

​ingest() Parameters

​Code Example

​CLI

​Required Environment Variables

​Combining Both Connectors

Build docs developers (and LLMs) love

BigQuerySchemaConnector

Import

Parameters

`ingest()` Parameters

Code Example

CLI

Required Environment Variables

BigQueryLogsConnector

Import

Parameters

`ingest()` Parameters

Code Example

CLI

Required Environment Variables

Combining Both Connectors