Storage Configuration

Overview

Amp stores dataset parquet files in object storage, supporting local filesystems, S3-compatible stores, Google Cloud Storage (GCS), and Azure Blob Storage. All directory configurations (data_dir, manifests_dir, providers_dir) support both filesystem paths and object store URLs.

Storage Backends

Amp supports the following storage backends:

Local Filesystem: For development and single-node deployments
Amazon S3: AWS S3 and S3-compatible stores (MinIO, DigitalOcean Spaces, etc.)
Google Cloud Storage (GCS): Google Cloud Storage buckets
Azure Blob Storage: Azure Blob Storage containers

Configuration

Local Filesystem

Use absolute or relative filesystem paths:

data_dir = "/var/amp/data"
manifests_dir = "./manifests"
providers_dir = "./providers"

When to use:

Local development and testing
Single-node deployments with local storage
Fast iteration during development

Limitations:

Not suitable for distributed deployments
No built-in replication or durability guarantees
Limited by local disk capacity

S3-Compatible Storage

Use S3 URLs for AWS S3 or S3-compatible object stores:

data_dir = "s3://my-bucket/amp/data"
manifests_dir = "s3://my-bucket/amp/manifests"
providers_dir = "s3://my-bucket/amp/providers"

URL format: s3://<bucket>/<optional-prefix> Authentication: Configure via environment variables:

# AWS credentials
export AWS_ACCESS_KEY_ID="your-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-access-key"
export AWS_DEFAULT_REGION="us-east-1"

# Optional: Custom endpoint for S3-compatible stores
export AWS_ENDPOINT="https://s3.example.com"

# Optional: Session token for temporary credentials
export AWS_SESSION_TOKEN="your-session-token"

# Optional: Allow non-TLS connections (not recommended for production)
export AWS_ALLOW_HTTP="true"

S3-Compatible Stores:

AWS S3: Use default AWS endpoints
MinIO: Set AWS_ENDPOINT to your MinIO server
DigitalOcean Spaces: Set AWS_ENDPOINT to https://<region>.digitaloceanspaces.com
Cloudflare R2: Set AWS_ENDPOINT to your R2 endpoint
Wasabi: Set AWS_ENDPOINT to https://s3.<region>.wasabisys.com

Google Cloud Storage (GCS)

Use GCS URLs for Google Cloud Storage:

data_dir = "gs://my-bucket/amp/data"
manifests_dir = "gs://my-bucket/amp/manifests"
providers_dir = "gs://my-bucket/amp/providers"

URL format: gs://<bucket>/<optional-prefix> Authentication: Configure via environment variables:

# Option 1: Service account file path
export GOOGLE_SERVICE_ACCOUNT_PATH="/path/to/service-account.json"

# Option 2: Service account key (JSON string)
export GOOGLE_SERVICE_ACCOUNT_KEY='{"type":"service_account",...}'

# Option 3: Application Default Credentials (ADC)
# No environment variable needed - uses gcloud auth or Compute Engine metadata

Service account permissions:

storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list

Azure Blob Storage

Use Azure URLs for Azure Blob Storage:

data_dir = "az://my-container/amp/data"
manifests_dir = "az://my-container/amp/manifests"
providers_dir = "az://my-container/amp/providers"

URL format: az://<container>/<optional-prefix> Authentication: Configure via environment variables:

# Account name and key
export AZURE_STORAGE_ACCOUNT_NAME="myaccount"
export AZURE_STORAGE_ACCOUNT_KEY="account-key"

# Or use connection string
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;..."

Data Directory Structure

The data_dir contains extracted parquet files organized by dataset, table, and revision:

data_dir/
├── namespace_name/              # Dataset (namespace/name)
│   ├── blocks/                  # Table name
│   │   ├── 01930eaf-7b5e-.../  # Revision ID (UUIDv7)
│   │   │   ├── 000000000-a1b2c3d4e5f67890.parquet
│   │   │   ├── 000001000-b2c3d4e5f6789a01.parquet
│   │   │   └── ...
│   │   └── 01930eb0-8c9d-.../  # Another revision
│   ├── transactions/
│   │   └── 01930eaf-7b5e-.../
│   └── logs/
│       └── 01930eaf-7b5e-.../
└── another_dataset/

Path Components

Dataset directory: namespace_name (e.g., my_org_eth_mainnet for my_org/eth_mainnet)
Table directory: Table name (e.g., blocks, transactions, logs)
Revision directory: UUIDv7 revision ID (e.g., 01930eaf-67e5-7b5e-80b0-8d3f2a5c4e1b)
Parquet files: Named {block_num:09}-{suffix:016x}.parquet

Revision Management

Each revision is immutable and contains a snapshot of table data:

Active revision: Currently queryable revision (tracked in metadata database)
Inactive revisions: Previous snapshots (retained for potential recovery)
Revision IDs: UUIDv7 format provides temporal ordering

Parquet File Management

Amp writes data as parquet files with the following characteristics:

File Naming

Files are named with the starting block number and a random suffix:

000000000-a1b2c3d4e5f67890.parquet
│         │
│         └─ Random 16-char hex suffix (for uniqueness)
└─ Starting block number (9 digits, zero-padded)

File Size Targets

Configure target file sizes in the [writer] section:

[writer]
# Target 2GB per file (default)
bytes = 2147483648

# Or target specific row count
rows = 1000000

# Overflow multiplier for flexibility
overflow = "1.5"  # Allow files up to 1.5x target size

Compression

Parquet files support multiple compression algorithms:

[writer]
compression = "zstd(1)"  # Default: zstd with level 1

Available algorithms:

zstd(level): Zstandard (levels 1-22, default 1)
lz4: LZ4 (fast compression)
gzip: Gzip (slower, better compression)
brotli(level): Brotli (levels 1-11)
snappy: Snappy (fast compression)
uncompressed: No compression

Compression tradeoffs:

zstd(1): Best balance of speed and compression ratio (recommended)
lz4: Fastest compression, lower ratio
zstd(3) or gzip: Better compression, slower writes
uncompressed: Fastest writes, largest files

Bloom Filters

Enable bloom filters for faster point lookups:

[writer]
bloom_filters = true  # Default: false

Bloom filters improve query performance for equality filters (WHERE column = value) but increase file size.

Metadata Cache

Configure parquet metadata caching:

[writer]
cache_size_mb = 1024  # Default: 1GB

The metadata cache stores parsed parquet footers in memory, avoiding repeated reads from object storage during query planning.

File Compaction

The compactor merges small files into larger ones to improve query performance:

[writer.compactor]
active = true                # Enable compactor (default: false)
metadata_concurrency = 2     # Concurrent metadata operations
write_concurrency = 2        # Concurrent compaction writes
min_interval = 1.0           # Run every 1 second
cooldown_duration = 1024.0   # Base cooldown in seconds
overflow = "1"               # Eager compaction overflow
bytes = 0                    # Eager compaction threshold (0 = use target size)
rows = 0                     # Eager compaction row threshold

How it works:

Scans for small files below target size
Groups files by block range
Merges files into larger parquet files
Marks old files for deletion
Updates metadata database

Garbage Collection

The garbage collector removes obsolete files:

[writer.collector]
active = true                    # Enable collector (default: false)
min_interval = 30.0              # Run every 30 seconds
deletion_lock_duration = 1800.0  # Hold lock for 30 minutes

Deletion process:

Identifies files marked for deletion (expired files)
Waits for deletion lock duration to ensure no queries are using them
Deletes files from object storage
Removes metadata entries from database

Only enable garbage collection after compaction is working correctly. Deleted files cannot be recovered.

Mixed Storage Backends

You can use different storage backends for different directories:

# Data in S3
data_dir = "s3://prod-bucket/data"

# Manifests in GCS
manifests_dir = "gs://config-bucket/manifests"

# Providers on local filesystem
providers_dir = "./providers"

This is useful for:

Cost optimization: Store large datasets in cheaper object storage
Performance: Keep frequently accessed configs on local storage
Security: Separate sensitive configs from public datasets

Best Practices

Production Deployments

Use object storage: S3, GCS, or Azure for distributed deployments
Enable versioning: Object store versioning for disaster recovery
Configure lifecycle policies: Automatically archive or delete old revisions
Monitor storage costs: Track storage growth and optimize compression
Test disaster recovery: Verify backup and restore procedures

Performance Optimization

Choose appropriate compression: Balance speed and storage costs
Enable compaction: Reduce small file overhead
Configure metadata cache: Size cache based on dataset count and memory
Use regional endpoints: Minimize latency to object storage
Monitor file counts: Too many small files slow down queries

Security

Use IAM roles: Avoid hardcoded credentials (use instance profiles, workload identity)
Encrypt at rest: Enable object storage encryption
Encrypt in transit: Use HTTPS endpoints
Restrict bucket access: Use bucket policies and VPC endpoints
Rotate credentials: Regular credential rotation for access keys

Troubleshooting

Connection Errors

S3 errors: Check AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION GCS errors: Verify service account credentials and permissions Azure errors: Confirm storage account name and key

Permission Errors

Ensure the service account or IAM role has:

Read: List objects, get objects
Write: Put objects, delete objects
List: List buckets/containers

Performance Issues

Slow queries: Check metadata cache size, enable compaction
High latency: Use regional endpoints closer to compute
Many small files: Enable compaction to merge files

Next Steps

Metadata Database

Configure PostgreSQL for metadata

Telemetry

Set up metrics and monitoring

Get Started

Core Concepts

Configuration

Querying Data

Data Sources

Administration

Deployment

Storage Configuration

Overview

Storage Backends

Configuration

Local Filesystem

S3-Compatible Storage

Google Cloud Storage (GCS)

Azure Blob Storage

Data Directory Structure

Path Components

Revision Management

Parquet File Management

File Naming

File Size Targets

Compression

Bloom Filters

Metadata Cache

File Compaction

Garbage Collection

Mixed Storage Backends

Best Practices

Production Deployments

Performance Optimization

Security

Troubleshooting

Connection Errors

Permission Errors

Performance Issues

Next Steps

Metadata Database

Telemetry

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Querying Data

Data Sources

Administration

Deployment

​Overview

​Storage Backends

​Configuration

​Local Filesystem

​S3-Compatible Storage

​Google Cloud Storage (GCS)

​Azure Blob Storage

​Data Directory Structure

​Path Components

​Revision Management

​Parquet File Management

​File Naming

​File Size Targets

​Compression

​Bloom Filters

​Metadata Cache

​File Compaction

​Garbage Collection

​Mixed Storage Backends

​Best Practices

​Production Deployments

​Performance Optimization

​Security

​Troubleshooting

​Connection Errors

​Permission Errors

​Performance Issues

​Next Steps

Metadata Database

Telemetry

Build docs developers (and LLMs) love

Overview

Storage Backends

Configuration

Local Filesystem

S3-Compatible Storage

Google Cloud Storage (GCS)

Azure Blob Storage

Data Directory Structure

Path Components

Revision Management

Parquet File Management

File Naming

File Size Targets

Compression

Bloom Filters

Metadata Cache

File Compaction

Garbage Collection

Mixed Storage Backends

Best Practices

Production Deployments

Performance Optimization

Security

Troubleshooting

Connection Errors

Permission Errors

Performance Issues

Next Steps