Overview
Amp implements a complete ETL (Extract, Transform, Load) pipeline for blockchain data:- Extract: Pull data from blockchain sources (EVM RPC, Firehose, Solana)
- Transform: Process data using SQL queries with custom UDFs
- Store: Save as Parquet files in columnar format optimized for analytics
- Serve: Provide query interfaces (Arrow Flight gRPC, JSON Lines HTTP)
Extract Phase
Data Sources
Amp extracts blockchain data from multiple source types: EVM RPC - Ethereum-compatible JSON-RPC endpoints- Connects to standard JSON-RPC APIs (HTTP/WebSocket/IPC)
- Batched requests for efficiency
- Configurable rate limiting and concurrency
- Tables: blocks, transactions, logs
- Real-time blockchain data streaming via gRPC
- Protocol buffer-based binary format
- Includes full call traces
- Tables: blocks, transactions, logs, calls
- Historical data from Old Faithful CAR files (~745GB per epoch)
- Real-time data via JSON-RPC endpoints
- Slot-based architecture with gap handling
- Tables: block_headers, transactions, messages, instructions
Provider Resolution
Providers decouple dataset definitions from concrete data sources: Provider matching criteria:kind(e.g., “evm-rpc”, “firehose”, “solana”)network(e.g., “mainnet”, “base”, “polygon”)
- Multiple datasets can share the same provider
- Switch endpoints without modifying datasets
- Random selection among matching providers for load balancing
- Credentials isolated from dataset definitions
Data Extraction
Workers pull data from blockchain sources using configured providers:- Job Assignment - Worker receives job from controller via PostgreSQL NOTIFY
- Provider Setup - Resolve and connect to appropriate data provider
- Block Range Processing - Stream blocks within assigned range
- Data Normalization - Convert blockchain-specific formats to Arrow schemas
- Batching - Accumulate records for efficient Parquet writing
Workers maintain heartbeats every 1 second and support graceful restart. Jobs automatically resume from the last committed block.
Transform Phase
Raw Datasets
Raw datasets extract blockchain data directly from providers with minimal transformation:- Schema defined by provider type (EVM, Firehose, Solana)
- Data written as-is to Parquet files
- Block range-based partitioning
- No custom SQL transformations
Derived Datasets
Derived datasets use SQL queries to transform raw data:- Reference other datasets as data sources
- Apply custom SQL transformations
- Use built-in UDFs for blockchain-specific operations
- Support JavaScript UDFs for custom logic
Built-in User-Defined Functions (UDFs)
Amp provides EVM-specific UDFs for common blockchain operations: Hex Encoding/Decodingevm_hex_encode()- Convert bytes to hex stringevm_hex_decode()- Convert hex string to bytes
evm_uint256_decode()- Decode 256-bit unsigned integersevm_int256_decode()- Decode 256-bit signed integersevm_address_decode()- Extract Ethereum addresses
evm_wei_to_eth()- Convert wei to ETHevm_eth_to_wei()- Convert ETH to weievm_gwei_to_wei()- Convert gwei to wei
evm_log_topics()- Extract log topics arrayevm_log_data()- Extract log data field
evm_decode_params()- Decode function parameters by ABI
Load Phase
Parquet File Storage
Transformed data is written to Parquet files with optimized layout: File Naming Convention:block_num: Starting block number (9 digits, zero-padded)suffix: Random 16-character hex value for uniqueness
- Columnar storage format (Parquet)
- Compression (typically Snappy or ZSTD)
- Embedded statistics (min/max/null count per column)
- Page indices for efficient filtering
- Immutable (write-once, never modified)
Storage Hierarchy
Parquet files are organized in object storage:| Level | Identifier | Cardinality | Mutability |
|---|---|---|---|
| Base | Object store URL | One per deployment | Configuration |
| Dataset | namespace/name | Many per deployment | Immutable identifier |
| Table | table_name | Many per dataset | Defined in manifest |
| Revision | UUIDv7 | Many per table, one active | Immutable snapshot |
| File | {block:09}-{suffix:016x}.parquet | Many per revision | Write-once |
Metadata Registration
As workers write Parquet files, they register metadata in PostgreSQL: Registered metadata:- File name and full URL in object storage
- Object metadata: size, etag, version
- Parquet footer bytes (including page indices)
- Computed statistics for query optimization
- Fast query planning without object store access
- Predicate pushdown using file statistics
- In-memory caching of Parquet metadata
- Efficient file pruning during queries
Table Revisions
Tables use an immutable revision model:- New data creates new revisions (never modifies existing)
- UUIDv7 provides temporal ordering (newer = lexicographically greater)
- Single active revision per table serves queries
- Retained revisions enable point-in-time recovery
- Atomic switches between revisions (single metadata update)
Queries always read from the active revision. Writers create new revisions while readers access existing ones, eliminating read-write contention.
Serve Phase
Query Planning
When a client submits a SQL query:- Parse SQL - DataFusion parses and validates the query
- Resolve Datasets - Identify referenced datasets and their tables
- Get File Metadata - Retrieve file list and statistics from metadata DB
- Prune Files - Use predicate pushdown to skip irrelevant files
- Create Execution Plan - Generate optimized physical plan
- Execute - Stream data from Parquet files via object store
Predicate Pushdown
DataFusion automatically optimizes queries using Parquet statistics: Example query:- Block range filter - Only read files with blocks 1M-2M using filename patterns
- Address filter - Skip files where
addresscolumn min/max stats don’t match - Row group pruning - Within files, skip row groups that don’t match
- Page-level filtering - Use page indices for fine-grained filtering
Query Execution Modes
Batch Queries (default)- One-shot execution against current data
- Returns complete result set
- Supports all SQL operations (aggregations, ORDER BY, LIMIT)
- Best for ad-hoc analysis and reports
- Continuous execution as new blocks arrive
- Emits incremental results (microbatches)
- Limited to incrementalizable operations (filters, projections, simple joins)
- Best for real-time monitoring and alerting
Transport Formats
Arrow Flight (port 1602)- Binary protocol over gRPC
- Zero-copy data transfer
- Streaming support built-in
- Best for high-throughput workloads
- HTTP POST with NDJSON response
- Newline-delimited JSON records
- Compression support (gzip, brotli, deflate)
- Best for debugging and simple integrations
End-to-End Example
A complete flow from blockchain source to query result:1. Dataset Deployment
2. Data Extraction (Worker)
3. Query Execution (Server)
4. Client Receives Results
Performance Characteristics
Extraction Throughput:- EVM RPC: 100-1,000 blocks/sec (depends on RPC endpoint)
- Firehose: 10,000+ blocks/sec (gRPC streaming)
- Solana: Varies by archive vs RPC mode
- Predicate pushdown reduces I/O by 90%+ in typical cases
- Columnar format enables efficient projection
- Parquet compression reduces storage costs 5-10x vs raw JSON
- In-memory metadata caching eliminates cold-start latency
- Horizontal scaling: Add more workers for extraction
- Vertical scaling: More CPU/memory for query servers
- Storage: Unlimited via object stores (S3/GCS)
- Metadata DB: PostgreSQL handles millions of files
Related Documentation
Architecture
Understand system components and technology stack
Datasets
Learn about dataset manifests, tables, and schemas
Providers
Configure blockchain data source connections
Querying Data
Write SQL queries with UDFs and optimization tips