Overview
Amp stores dataset parquet files in object storage, supporting local filesystems, S3-compatible stores, Google Cloud Storage (GCS), and Azure Blob Storage. All directory configurations (data_dir, manifests_dir, providers_dir) support both filesystem paths and object store URLs.
Storage Backends
Amp supports the following storage backends:- Local Filesystem: For development and single-node deployments
- Amazon S3: AWS S3 and S3-compatible stores (MinIO, DigitalOcean Spaces, etc.)
- Google Cloud Storage (GCS): Google Cloud Storage buckets
- Azure Blob Storage: Azure Blob Storage containers
Configuration
Local Filesystem
Use absolute or relative filesystem paths:- Local development and testing
- Single-node deployments with local storage
- Fast iteration during development
- Not suitable for distributed deployments
- No built-in replication or durability guarantees
- Limited by local disk capacity
S3-Compatible Storage
Use S3 URLs for AWS S3 or S3-compatible object stores:s3://<bucket>/<optional-prefix>
Authentication: Configure via environment variables:
- AWS S3: Use default AWS endpoints
- MinIO: Set
AWS_ENDPOINTto your MinIO server - DigitalOcean Spaces: Set
AWS_ENDPOINTtohttps://<region>.digitaloceanspaces.com - Cloudflare R2: Set
AWS_ENDPOINTto your R2 endpoint - Wasabi: Set
AWS_ENDPOINTtohttps://s3.<region>.wasabisys.com
Google Cloud Storage (GCS)
Use GCS URLs for Google Cloud Storage:gs://<bucket>/<optional-prefix>
Authentication: Configure via environment variables:
storage.objects.createstorage.objects.deletestorage.objects.getstorage.objects.list
Azure Blob Storage
Use Azure URLs for Azure Blob Storage:az://<container>/<optional-prefix>
Authentication: Configure via environment variables:
Data Directory Structure
Thedata_dir contains extracted parquet files organized by dataset, table, and revision:
Path Components
- Dataset directory:
namespace_name(e.g.,my_org_eth_mainnetformy_org/eth_mainnet) - Table directory: Table name (e.g.,
blocks,transactions,logs) - Revision directory: UUIDv7 revision ID (e.g.,
01930eaf-67e5-7b5e-80b0-8d3f2a5c4e1b) - Parquet files: Named
{block_num:09}-{suffix:016x}.parquet
Revision Management
Each revision is immutable and contains a snapshot of table data:- Active revision: Currently queryable revision (tracked in metadata database)
- Inactive revisions: Previous snapshots (retained for potential recovery)
- Revision IDs: UUIDv7 format provides temporal ordering
Parquet File Management
Amp writes data as parquet files with the following characteristics:File Naming
Files are named with the starting block number and a random suffix:File Size Targets
Configure target file sizes in the[writer] section:
Compression
Parquet files support multiple compression algorithms:zstd(level): Zstandard (levels 1-22, default 1)lz4: LZ4 (fast compression)gzip: Gzip (slower, better compression)brotli(level): Brotli (levels 1-11)snappy: Snappy (fast compression)uncompressed: No compression
- zstd(1): Best balance of speed and compression ratio (recommended)
- lz4: Fastest compression, lower ratio
- zstd(3) or gzip: Better compression, slower writes
- uncompressed: Fastest writes, largest files
Bloom Filters
Enable bloom filters for faster point lookups:WHERE column = value) but increase file size.
Metadata Cache
Configure parquet metadata caching:File Compaction
The compactor merges small files into larger ones to improve query performance:- Scans for small files below target size
- Groups files by block range
- Merges files into larger parquet files
- Marks old files for deletion
- Updates metadata database
Garbage Collection
The garbage collector removes obsolete files:- Identifies files marked for deletion (expired files)
- Waits for deletion lock duration to ensure no queries are using them
- Deletes files from object storage
- Removes metadata entries from database
Mixed Storage Backends
You can use different storage backends for different directories:- Cost optimization: Store large datasets in cheaper object storage
- Performance: Keep frequently accessed configs on local storage
- Security: Separate sensitive configs from public datasets
Best Practices
Production Deployments
- Use object storage: S3, GCS, or Azure for distributed deployments
- Enable versioning: Object store versioning for disaster recovery
- Configure lifecycle policies: Automatically archive or delete old revisions
- Monitor storage costs: Track storage growth and optimize compression
- Test disaster recovery: Verify backup and restore procedures
Performance Optimization
- Choose appropriate compression: Balance speed and storage costs
- Enable compaction: Reduce small file overhead
- Configure metadata cache: Size cache based on dataset count and memory
- Use regional endpoints: Minimize latency to object storage
- Monitor file counts: Too many small files slow down queries
Security
- Use IAM roles: Avoid hardcoded credentials (use instance profiles, workload identity)
- Encrypt at rest: Enable object storage encryption
- Encrypt in transit: Use HTTPS endpoints
- Restrict bucket access: Use bucket policies and VPC endpoints
- Rotate credentials: Regular credential rotation for access keys
Troubleshooting
Connection Errors
S3 errors: CheckAWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
GCS errors: Verify service account credentials and permissions
Azure errors: Confirm storage account name and key
Permission Errors
Ensure the service account or IAM role has:- Read: List objects, get objects
- Write: Put objects, delete objects
- List: List buckets/containers
Performance Issues
- Slow queries: Check metadata cache size, enable compaction
- High latency: Use regional endpoints closer to compute
- Many small files: Enable compaction to merge files
Next Steps
Metadata Database
Configure PostgreSQL for metadata
Telemetry
Set up metrics and monitoring