Skip to main content

Overview

Datasets are the fundamental organizing unit in Amp. A dataset defines:
  • Schema: Tables, columns, and data types
  • Source: Where data comes from (provider reference or SQL query)
  • Configuration: Extraction parameters, partitioning, sort order
  • Version: Semantic versioning and tags for lifecycle management
Datasets are defined in manifest files (JSON documents) that describe how to extract, transform, and organize blockchain data.

Dataset Types

Amp supports two types of datasets:

Raw Datasets

Raw datasets extract blockchain data directly from providers with minimal transformation. The schema is determined by the provider type. Characteristics:
  • Data extracted directly from blockchain sources
  • Schema defined by provider (evm-rpc, firehose, solana)
  • No custom SQL transformations
  • Tables written as-is to Parquet files
Example use cases:
  • Complete blockchain history (all blocks, transactions, logs)
  • Foundation for derived datasets
  • Archival storage of raw blockchain data
Supported provider types:
ProviderTablesDescription
evm-rpcblocks, transactions, logsEVM chains via JSON-RPC
firehoseblocks, transactions, logs, callsEVM chains via Firehose gRPC
solanablock_headers, transactions, messages, instructionsSolana via RPC + Old Faithful

Derived Datasets

Derived datasets use SQL queries to transform data from other datasets. They enable data enrichment, filtering, and custom schemas. Characteristics:
  • Reference other datasets as data sources
  • Custom SQL transformations
  • Use built-in UDFs for blockchain operations
  • Support JavaScript UDFs for custom logic
Example use cases:
  • Extract specific smart contract events (ERC-20 transfers)
  • Decode transaction parameters by ABI
  • Aggregate data across multiple blocks
  • Join data from multiple datasets
Example SQL transformation:
-- Extract USDC Transfer events
SELECT 
  block_num,
  tx_hash,
  evm_hex_decode(topics[1]) as from_address,
  evm_hex_decode(topics[2]) as to_address,
  evm_uint256_decode(data) as amount,
  timestamp
FROM 'ethereum/mainnet'.logs
WHERE address = '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48'
  AND topics[0] = '0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef'
Derived datasets are particularly powerful for creating application-specific views of blockchain data without maintaining custom indexers.

Dataset Manifests

Manifests are JSON documents that fully describe a dataset. They are content-addressable (identified by SHA-256 hash) and immutable.

Manifest Structure

A typical manifest contains:
{
  "name": "mainnet",
  "namespace": "ethereum",
  "kind": "evm-rpc",
  "network": "mainnet",
  "start_block": 0,
  "finalized_blocks_only": true,
  "tables": [
    {
      "name": "blocks",
      "schema": [...],
      "sort_by": ["block_num"],
      "partition_by": ["block_num"]
    },
    {
      "name": "transactions",
      "schema": [...],
      "sort_by": ["block_num", "tx_index"],
      "partition_by": ["block_num"]
    },
    {
      "name": "logs",
      "schema": [...],
      "sort_by": ["block_num", "tx_index", "log_index"],
      "partition_by": ["block_num"]
    }
  ]
}

Manifest Fields

Dataset identification:
  • namespace: Organizational grouping (e.g., “ethereum”, “polygon”)
  • name: Dataset name (e.g., “mainnet”, “usdc-transfers”)
  • kind: Dataset type (“evm-rpc”, “firehose”, “solana”, “sql”)
Data source configuration:
  • network: Blockchain network identifier (for provider matching)
  • start_block: First block to extract (default: 0)
  • finalized_blocks_only: Only extract finalized blocks (default: false)
Table definitions:
  • tables: Array of table schemas
    • name: Table name
    • schema: Column definitions (name, type, nullable)
    • sort_by: Columns to sort by (affects query performance)
    • partition_by: Partitioning columns (typically block_num)

Content-Addressable Storage

Manifests are stored using their content hash:
SHA-256(manifest_json) → 8a3f9c1d...
Benefits:
  • Deduplication: Identical manifests stored once
  • Integrity: Hash verifies content hasn’t changed
  • Versioning: Different versions have different hashes
  • Immutability: Cannot modify existing manifests
Manifests are stored in object storage (S3/GCS) with metadata tracked in PostgreSQL. The Dataset Registry manages manifest lifecycle and version tags.

Dataset Versioning

Datasets support semantic versioning and special tags:

Version Tags

Three types of version tags: Semantic Versions (e.g., 1.0.0, 2.1.3)
  • Explicit version numbers set by users
  • Follow semver conventions
  • Immutable once created
latest Tag
  • Automatically points to highest semantic version
  • Updated when new semantic version is tagged
  • Recommended for production deployments
dev Tag
  • Points to most recently linked manifest
  • Updated on every registration (with or without semantic version)
  • Used for development and testing

Version Management Example

# Register manifest without semantic version (updates "dev" only)
ampctl dataset register ethereum/mainnet ./manifest.json
# → dev: 8a3f9c1d...

# Register with semantic version (updates "dev" and "1.0.0", sets "latest")
ampctl dataset register ethereum/mainnet ./manifest-v1.json --tag 1.0.0
# → dev: 5b2e7a3c...
# → 1.0.0: 5b2e7a3c...
# → latest: 5b2e7a3c... (points to 1.0.0)

# Register newer version
ampctl dataset register ethereum/mainnet ./manifest-v2.json --tag 2.0.0
# → dev: 9f4d8e1a...
# → 2.0.0: 9f4d8e1a...
# → latest: 9f4d8e1a... (now points to 2.0.0)
# → 1.0.0: 5b2e7a3c... (unchanged, still accessible)

Referencing Datasets

Datasets are referenced using the format:
namespace/name@version
Examples:
-- Reference latest version
SELECT * FROM 'ethereum/mainnet@latest'.blocks

-- Reference specific version
SELECT * FROM 'ethereum/[email protected]'.blocks

-- Reference dev version
SELECT * FROM 'ethereum/mainnet@dev'.blocks

-- Implicit latest (version omitted)
SELECT * FROM 'ethereum/mainnet'.blocks

Tables and Schemas

Table Structure

Each table in a dataset has: Schema - Column definitions:
{
  "name": "block_num",
  "type": "UInt64",
  "nullable": false
}
Sort Order - Columns to sort by:
"sort_by": ["block_num", "tx_index"]
Sorting improves query performance by enabling efficient range scans. Partitioning - How data is divided:
"partition_by": ["block_num"]
Partitioning by block_num enables file pruning based on block ranges.

Common Schema Patterns

EVM Blocks Table:
block_num: UInt64 (partition key, sort key)
hash: String
parent_hash: String
timestamp: Timestamp
miner: String
gas_used: UInt64
gas_limit: UInt64
base_fee_per_gas: UInt64
EVM Logs Table:
block_num: UInt64 (partition key, sort key)
tx_index: UInt32 (sort key)
log_index: UInt32 (sort key)
address: String
topics: List<String>
data: String
tx_hash: String
All tables should include block_num for efficient querying and partitioning. This is the primary dimension for filtering blockchain data.

Table Revisions

Tables use an immutable revision model for data storage:

Key Concepts

Revision - An immutable snapshot of a table’s data at a point in time:
  • Identified by UUIDv7 (temporally ordered)
  • Contains zero or more Parquet files
  • Lives at unique path in object storage
  • Never modified after creation
Active Revision - The single revision currently used for queries:
  • Only one revision is active per table
  • Queries always read from active revision
  • Previous revisions retained but not queried
Revision Path Structure:
<dataset_namespace>/<dataset_name>/<table_name>/<revision_uuid>/

Revision Lifecycle

Creation:
  1. Generate new UUIDv7
  2. Construct storage path: <dataset>/<table>/<uuid>/
  3. Register in metadata database
  4. Lock to writer job
Population:
  1. Worker writes Parquet files to revision path
  2. Registers each file’s metadata (size, stats, footer)
  3. Updates job progress
Activation:
  1. Update table’s active_revision_id pointer
  2. Previous active revision becomes retained
  3. Atomic switch (single database operation)
Benefits:
  • No read-write contention: Queries read old revision while writers populate new
  • Atomic updates: Table switches to new data in single operation
  • Point-in-time recovery: Can revert to previous revisions
  • Concurrent writes: Different tables can write independently

Revision Metadata

Each revision tracks:
id: UUIDv7
path: dataset/table/uuid
writer: job_id (lock owner)
metadata: JSONB (informative data)
created_at: Timestamp
updated_at: Timestamp
The metadata field stores informative data for debugging:
  • Dataset associated at creation time
  • Schema versions
  • Custom annotations

Dataset Registry

The Dataset Registry manages dataset lifecycle:

Registry Operations

Register Manifest:
# Store manifest in content-addressable storage
ampctl manifest register ./manifest.json
# → Manifest hash: 8a3f9c1d2e5b7a9f...

# Link manifest to dataset with version tag
ampctl dataset register ethereum/mainnet ./manifest.json --tag 1.0.0
Deploy Dataset:
# Start extraction job for continuous sync
ampctl dataset deploy ethereum/[email protected]

# Extract specific block range
ampctl dataset deploy ethereum/[email protected] --end-block 5000000
List Datasets:
ampctl dataset list
Inspect Dataset:
# View dataset details and manifest
ampctl dataset inspect ethereum/[email protected]

# Get raw manifest JSON
ampctl dataset manifest ethereum/[email protected]

Storage Architecture

The registry uses two storage layers: Metadata Database (PostgreSQL):
  • Stores manifest metadata and version tags
  • Tracks dataset-manifest links
  • Provides transactional consistency
  • Enables version resolution
Object Store (S3/GCS/local):
  • Stores actual manifest file content
  • Content-addressed by hash
  • Durable storage for JSON files

Dataset Configuration

Provider Reference (Raw Datasets)

Raw datasets specify provider matching criteria:
{
  "kind": "evm-rpc",
  "network": "mainnet"
}
At runtime, the provider registry finds matching providers:
  1. Filter by kind and network
  2. Shuffle for load balancing
  3. Try each until connection succeeds

SQL Source (Derived Datasets)

Derived datasets specify SQL transformations:
{
  "kind": "sql",
  "query": "SELECT ... FROM 'ethereum/mainnet'.logs WHERE ..."
}

Extraction Configuration

Block Range:
{
  "start_block": 0,
  "end_block": null  // null = continuous sync
}
Finality:
{
  "finalized_blocks_only": true  // wait for finality before extracting
}
Parallelism:
# Set via deploy command
ampctl dataset deploy ethereum/[email protected] --parallelism 4

Best Practices

Schema Design

Always partition by block_num - Enables efficient file pruning and range queries
Sort by query patterns - If you frequently filter by address, include it in sort_by
Use appropriate data types - UInt64 for block numbers, String for hashes and addresses

Versioning Strategy

Use semantic versions for production - Tag stable versions as 1.0.0, 2.0.0, etc.
Use latest tag for deployments - Reference @latest in production queries for automatic updates
Use dev tag for development - Test manifest changes with @dev before tagging releases

Performance Optimization

Derived datasets inherit partitioning from source tables. Filter on block_num in SQL to maintain efficient pruning.
Use UDFs for blockchain-specific operations instead of external processing. This keeps transformations close to data.

Architecture

Understand how datasets fit into Amp’s architecture

Data Flow

See how datasets flow through the ETL pipeline

Providers

Configure data source connections for raw datasets

Creating Datasets

Step-by-step guide to creating and deploying datasets

Build docs developers (and LLMs) love