Overview
Datasets are the fundamental organizing unit in Amp. A dataset defines:- Schema: Tables, columns, and data types
- Source: Where data comes from (provider reference or SQL query)
- Configuration: Extraction parameters, partitioning, sort order
- Version: Semantic versioning and tags for lifecycle management
Dataset Types
Amp supports two types of datasets:Raw Datasets
Raw datasets extract blockchain data directly from providers with minimal transformation. The schema is determined by the provider type. Characteristics:- Data extracted directly from blockchain sources
- Schema defined by provider (evm-rpc, firehose, solana)
- No custom SQL transformations
- Tables written as-is to Parquet files
- Complete blockchain history (all blocks, transactions, logs)
- Foundation for derived datasets
- Archival storage of raw blockchain data
| Provider | Tables | Description |
|---|---|---|
evm-rpc | blocks, transactions, logs | EVM chains via JSON-RPC |
firehose | blocks, transactions, logs, calls | EVM chains via Firehose gRPC |
solana | block_headers, transactions, messages, instructions | Solana via RPC + Old Faithful |
Derived Datasets
Derived datasets use SQL queries to transform data from other datasets. They enable data enrichment, filtering, and custom schemas. Characteristics:- Reference other datasets as data sources
- Custom SQL transformations
- Use built-in UDFs for blockchain operations
- Support JavaScript UDFs for custom logic
- Extract specific smart contract events (ERC-20 transfers)
- Decode transaction parameters by ABI
- Aggregate data across multiple blocks
- Join data from multiple datasets
Derived datasets are particularly powerful for creating application-specific views of blockchain data without maintaining custom indexers.
Dataset Manifests
Manifests are JSON documents that fully describe a dataset. They are content-addressable (identified by SHA-256 hash) and immutable.Manifest Structure
A typical manifest contains:Manifest Fields
Dataset identification:namespace: Organizational grouping (e.g., “ethereum”, “polygon”)name: Dataset name (e.g., “mainnet”, “usdc-transfers”)kind: Dataset type (“evm-rpc”, “firehose”, “solana”, “sql”)
network: Blockchain network identifier (for provider matching)start_block: First block to extract (default: 0)finalized_blocks_only: Only extract finalized blocks (default: false)
tables: Array of table schemasname: Table nameschema: Column definitions (name, type, nullable)sort_by: Columns to sort by (affects query performance)partition_by: Partitioning columns (typically block_num)
Content-Addressable Storage
Manifests are stored using their content hash:- Deduplication: Identical manifests stored once
- Integrity: Hash verifies content hasn’t changed
- Versioning: Different versions have different hashes
- Immutability: Cannot modify existing manifests
Dataset Versioning
Datasets support semantic versioning and special tags:Version Tags
Three types of version tags: Semantic Versions (e.g.,1.0.0, 2.1.3)
- Explicit version numbers set by users
- Follow semver conventions
- Immutable once created
latest Tag
- Automatically points to highest semantic version
- Updated when new semantic version is tagged
- Recommended for production deployments
dev Tag
- Points to most recently linked manifest
- Updated on every registration (with or without semantic version)
- Used for development and testing
Version Management Example
Referencing Datasets
Datasets are referenced using the format:Tables and Schemas
Table Structure
Each table in a dataset has: Schema - Column definitions:block_num enables file pruning based on block ranges.
Common Schema Patterns
EVM Blocks Table:All tables should include
block_num for efficient querying and partitioning. This is the primary dimension for filtering blockchain data.Table Revisions
Tables use an immutable revision model for data storage:Key Concepts
Revision - An immutable snapshot of a table’s data at a point in time:- Identified by UUIDv7 (temporally ordered)
- Contains zero or more Parquet files
- Lives at unique path in object storage
- Never modified after creation
- Only one revision is active per table
- Queries always read from active revision
- Previous revisions retained but not queried
Revision Lifecycle
Creation:- Generate new UUIDv7
- Construct storage path:
<dataset>/<table>/<uuid>/ - Register in metadata database
- Lock to writer job
- Worker writes Parquet files to revision path
- Registers each file’s metadata (size, stats, footer)
- Updates job progress
- Update table’s
active_revision_idpointer - Previous active revision becomes retained
- Atomic switch (single database operation)
- No read-write contention: Queries read old revision while writers populate new
- Atomic updates: Table switches to new data in single operation
- Point-in-time recovery: Can revert to previous revisions
- Concurrent writes: Different tables can write independently
Revision Metadata
Each revision tracks:metadata field stores informative data for debugging:
- Dataset associated at creation time
- Schema versions
- Custom annotations
Dataset Registry
The Dataset Registry manages dataset lifecycle:Registry Operations
Register Manifest:Storage Architecture
The registry uses two storage layers: Metadata Database (PostgreSQL):- Stores manifest metadata and version tags
- Tracks dataset-manifest links
- Provides transactional consistency
- Enables version resolution
- Stores actual manifest file content
- Content-addressed by hash
- Durable storage for JSON files
Dataset Configuration
Provider Reference (Raw Datasets)
Raw datasets specify provider matching criteria:- Filter by
kindandnetwork - Shuffle for load balancing
- Try each until connection succeeds
SQL Source (Derived Datasets)
Derived datasets specify SQL transformations:Extraction Configuration
Block Range:Best Practices
Schema Design
Always partition by
block_num - Enables efficient file pruning and range queriesSort by query patterns - If you frequently filter by
address, include it in sort_byUse appropriate data types -
UInt64 for block numbers, String for hashes and addressesVersioning Strategy
Use semantic versions for production - Tag stable versions as
1.0.0, 2.0.0, etc.Use
latest tag for deployments - Reference @latest in production queries for automatic updatesUse
dev tag for development - Test manifest changes with @dev before tagging releasesPerformance Optimization
Related Documentation
Architecture
Understand how datasets fit into Amp’s architecture
Data Flow
See how datasets flow through the ETL pipeline
Providers
Configure data source connections for raw datasets
Creating Datasets
Step-by-step guide to creating and deploying datasets