Datasets - Amp

Overview

Datasets are the fundamental organizing unit in Amp. A dataset defines:

Schema: Tables, columns, and data types
Source: Where data comes from (provider reference or SQL query)
Configuration: Extraction parameters, partitioning, sort order
Version: Semantic versioning and tags for lifecycle management

Datasets are defined in manifest files (JSON documents) that describe how to extract, transform, and organize blockchain data.

Dataset Types

Amp supports two types of datasets:

Raw Datasets

Raw datasets extract blockchain data directly from providers with minimal transformation. The schema is determined by the provider type. Characteristics:

Data extracted directly from blockchain sources
Schema defined by provider (evm-rpc, firehose, solana)
No custom SQL transformations
Tables written as-is to Parquet files

Example use cases:

Complete blockchain history (all blocks, transactions, logs)
Foundation for derived datasets
Archival storage of raw blockchain data

Supported provider types:

Provider	Tables	Description
`evm-rpc`	blocks, transactions, logs	EVM chains via JSON-RPC
`firehose`	blocks, transactions, logs, calls	EVM chains via Firehose gRPC
`solana`	block_headers, transactions, messages, instructions	Solana via RPC + Old Faithful

Derived Datasets

Derived datasets use SQL queries to transform data from other datasets. They enable data enrichment, filtering, and custom schemas. Characteristics:

Reference other datasets as data sources
Custom SQL transformations
Use built-in UDFs for blockchain operations
Support JavaScript UDFs for custom logic

Example use cases:

Extract specific smart contract events (ERC-20 transfers)
Decode transaction parameters by ABI
Aggregate data across multiple blocks
Join data from multiple datasets

Example SQL transformation:

-- Extract USDC Transfer events
SELECT 
  block_num,
  tx_hash,
  evm_hex_decode(topics[1]) as from_address,
  evm_hex_decode(topics[2]) as to_address,
  evm_uint256_decode(data) as amount,
  timestamp
FROM 'ethereum/mainnet'.logs
WHERE address = '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48'
  AND topics[0] = '0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef'

Derived datasets are particularly powerful for creating application-specific views of blockchain data without maintaining custom indexers.

Dataset Manifests

Manifests are JSON documents that fully describe a dataset. They are content-addressable (identified by SHA-256 hash) and immutable.

Manifest Structure

A typical manifest contains:

{
  "name": "mainnet",
  "namespace": "ethereum",
  "kind": "evm-rpc",
  "network": "mainnet",
  "start_block": 0,
  "finalized_blocks_only": true,
  "tables": [
    {
      "name": "blocks",
      "schema": [...],
      "sort_by": ["block_num"],
      "partition_by": ["block_num"]
    },
    {
      "name": "transactions",
      "schema": [...],
      "sort_by": ["block_num", "tx_index"],
      "partition_by": ["block_num"]
    },
    {
      "name": "logs",
      "schema": [...],
      "sort_by": ["block_num", "tx_index", "log_index"],
      "partition_by": ["block_num"]
    }
  ]
}

Manifest Fields

Dataset identification:

namespace: Organizational grouping (e.g., “ethereum”, “polygon”)
name: Dataset name (e.g., “mainnet”, “usdc-transfers”)
kind: Dataset type (“evm-rpc”, “firehose”, “solana”, “sql”)

Data source configuration:

network: Blockchain network identifier (for provider matching)
start_block: First block to extract (default: 0)
finalized_blocks_only: Only extract finalized blocks (default: false)

Table definitions:

tables: Array of table schemas
- name: Table name
- schema: Column definitions (name, type, nullable)
- sort_by: Columns to sort by (affects query performance)
- partition_by: Partitioning columns (typically block_num)

Content-Addressable Storage

Manifests are stored using their content hash:

SHA-256(manifest_json) → 8a3f9c1d...

Benefits:

Deduplication: Identical manifests stored once
Integrity: Hash verifies content hasn’t changed
Versioning: Different versions have different hashes
Immutability: Cannot modify existing manifests

Manifests are stored in object storage (S3/GCS) with metadata tracked in PostgreSQL. The Dataset Registry manages manifest lifecycle and version tags.

Dataset Versioning

Datasets support semantic versioning and special tags:

Version Tags

Three types of version tags: Semantic Versions (e.g., 1.0.0, 2.1.3)

Explicit version numbers set by users
Follow semver conventions
Immutable once created

latest Tag

Automatically points to highest semantic version
Updated when new semantic version is tagged
Recommended for production deployments

dev Tag

Points to most recently linked manifest
Updated on every registration (with or without semantic version)
Used for development and testing

Version Management Example

# Register manifest without semantic version (updates "dev" only)
ampctl dataset register ethereum/mainnet ./manifest.json
# → dev: 8a3f9c1d...

# Register with semantic version (updates "dev" and "1.0.0", sets "latest")
ampctl dataset register ethereum/mainnet ./manifest-v1.json --tag 1.0.0
# → dev: 5b2e7a3c...
# → 1.0.0: 5b2e7a3c...
# → latest: 5b2e7a3c... (points to 1.0.0)

# Register newer version
ampctl dataset register ethereum/mainnet ./manifest-v2.json --tag 2.0.0
# → dev: 9f4d8e1a...
# → 2.0.0: 9f4d8e1a...
# → latest: 9f4d8e1a... (now points to 2.0.0)
# → 1.0.0: 5b2e7a3c... (unchanged, still accessible)

Referencing Datasets

Datasets are referenced using the format:

namespace/name@version

Examples:

-- Reference latest version
SELECT * FROM 'ethereum/mainnet@latest'.blocks

-- Reference specific version
SELECT * FROM 'ethereum/[email protected]'.blocks

-- Reference dev version
SELECT * FROM 'ethereum/mainnet@dev'.blocks

-- Implicit latest (version omitted)
SELECT * FROM 'ethereum/mainnet'.blocks

Tables and Schemas

Table Structure

Each table in a dataset has: Schema - Column definitions:

{
  "name": "block_num",
  "type": "UInt64",
  "nullable": false
}

Sort Order - Columns to sort by:

"sort_by": ["block_num", "tx_index"]

Sorting improves query performance by enabling efficient range scans. Partitioning - How data is divided:

"partition_by": ["block_num"]

Partitioning by block_num enables file pruning based on block ranges.

Common Schema Patterns

EVM Blocks Table:

block_num: UInt64 (partition key, sort key)
hash: String
parent_hash: String
timestamp: Timestamp
miner: String
gas_used: UInt64
gas_limit: UInt64
base_fee_per_gas: UInt64

EVM Logs Table:

block_num: UInt64 (partition key, sort key)
tx_index: UInt32 (sort key)
log_index: UInt32 (sort key)
address: String
topics: List<String>
data: String
tx_hash: String

All tables should include block_num for efficient querying and partitioning. This is the primary dimension for filtering blockchain data.

Table Revisions

Tables use an immutable revision model for data storage:

Key Concepts

Revision - An immutable snapshot of a table’s data at a point in time:

Identified by UUIDv7 (temporally ordered)
Contains zero or more Parquet files
Lives at unique path in object storage
Never modified after creation

Active Revision - The single revision currently used for queries:

Only one revision is active per table
Queries always read from active revision
Previous revisions retained but not queried

Revision Path Structure:

<dataset_namespace>/<dataset_name>/<table_name>/<revision_uuid>/

Revision Lifecycle

Creation:

Generate new UUIDv7
Construct storage path: <dataset>/<table>/<uuid>/
Register in metadata database
Lock to writer job

Population:

Worker writes Parquet files to revision path
Registers each file’s metadata (size, stats, footer)
Updates job progress

Activation:

Update table’s active_revision_id pointer
Previous active revision becomes retained
Atomic switch (single database operation)

Benefits:

No read-write contention: Queries read old revision while writers populate new
Atomic updates: Table switches to new data in single operation
Point-in-time recovery: Can revert to previous revisions
Concurrent writes: Different tables can write independently

Revision Metadata

Each revision tracks:

id: UUIDv7
path: dataset/table/uuid
writer: job_id (lock owner)
metadata: JSONB (informative data)
created_at: Timestamp
updated_at: Timestamp

The metadata field stores informative data for debugging:

Dataset associated at creation time
Schema versions
Custom annotations

Dataset Registry

The Dataset Registry manages dataset lifecycle:

Registry Operations

Register Manifest:

# Store manifest in content-addressable storage
ampctl manifest register ./manifest.json
# → Manifest hash: 8a3f9c1d2e5b7a9f...

# Link manifest to dataset with version tag
ampctl dataset register ethereum/mainnet ./manifest.json --tag 1.0.0

Deploy Dataset:

# Start extraction job for continuous sync
ampctl dataset deploy ethereum/[email protected]

# Extract specific block range
ampctl dataset deploy ethereum/[email protected] --end-block 5000000

List Datasets:

ampctl dataset list

Inspect Dataset:

# View dataset details and manifest
ampctl dataset inspect ethereum/[email protected]

# Get raw manifest JSON
ampctl dataset manifest ethereum/[email protected]

Storage Architecture

The registry uses two storage layers: Metadata Database (PostgreSQL):

Stores manifest metadata and version tags
Tracks dataset-manifest links
Provides transactional consistency
Enables version resolution

Object Store (S3/GCS/local):

Stores actual manifest file content
Content-addressed by hash
Durable storage for JSON files

Dataset Configuration

Provider Reference (Raw Datasets)

Raw datasets specify provider matching criteria:

{
  "kind": "evm-rpc",
  "network": "mainnet"
}

At runtime, the provider registry finds matching providers:

Filter by kind and network
Shuffle for load balancing
Try each until connection succeeds

SQL Source (Derived Datasets)

Derived datasets specify SQL transformations:

{
  "kind": "sql",
  "query": "SELECT ... FROM 'ethereum/mainnet'.logs WHERE ..."
}

Extraction Configuration

Block Range:

{
  "start_block": 0,
  "end_block": null  // null = continuous sync
}

Finality:

{
  "finalized_blocks_only": true  // wait for finality before extracting
}

Parallelism:

# Set via deploy command
ampctl dataset deploy ethereum/[email protected] --parallelism 4

Best Practices

Schema Design

Always partition by block_num - Enables efficient file pruning and range queries

Sort by query patterns - If you frequently filter by address, include it in sort_by

Use appropriate data types - UInt64 for block numbers, String for hashes and addresses

Versioning Strategy

Use semantic versions for production - Tag stable versions as 1.0.0, 2.0.0, etc.

Use latest tag for deployments - Reference @latest in production queries for automatic updates

Use dev tag for development - Test manifest changes with @dev before tagging releases

Performance Optimization

Derived datasets inherit partitioning from source tables. Filter on block_num in SQL to maintain efficient pruning.

Use UDFs for blockchain-specific operations instead of external processing. This keeps transformations close to data.

Architecture

Understand how datasets fit into Amp’s architecture

Data Flow

See how datasets flow through the ETL pipeline

Providers

Configure data source connections for raw datasets

Creating Datasets

Step-by-step guide to creating and deploying datasets

Get Started

Core Concepts

Configuration

Querying Data

Data Sources

Administration

Deployment

​Overview

​Dataset Types

​Raw Datasets

​Derived Datasets

​Dataset Manifests

​Manifest Structure

​Manifest Fields

​Content-Addressable Storage

​Dataset Versioning

​Version Tags

​Version Management Example

​Referencing Datasets

​Tables and Schemas

​Table Structure

​Common Schema Patterns

​Table Revisions

​Key Concepts

​Revision Lifecycle

​Revision Metadata

​Dataset Registry

​Registry Operations

​Storage Architecture

​Dataset Configuration

​Provider Reference (Raw Datasets)

​SQL Source (Derived Datasets)

​Extraction Configuration

​Best Practices

​Schema Design

​Versioning Strategy

​Performance Optimization

​Related Documentation

Architecture

Data Flow

Providers

Creating Datasets

Build docs developers (and LLMs) love

Overview

Dataset Types

Raw Datasets

Derived Datasets

Dataset Manifests

Manifest Structure

Manifest Fields

Content-Addressable Storage

Dataset Versioning

Version Tags

Version Management Example

Referencing Datasets

Tables and Schemas

Table Structure

Common Schema Patterns

Table Revisions

Key Concepts

Revision Lifecycle

Revision Metadata

Dataset Registry

Registry Operations

Storage Architecture

Dataset Configuration

Provider Reference (Raw Datasets)

SQL Source (Derived Datasets)

Extraction Configuration

Best Practices

Schema Design

Versioning Strategy

Performance Optimization

Related Documentation