Skip to main content

Configuration File

Amp uses a TOML configuration file to configure both the extraction and serving of datasets. The configuration file path is specified via the AMP_CONFIG environment variable.
export AMP_CONFIG=/path/to/config.toml

Solo Mode Auto-Discovery

For ampd solo, the configuration file is automatically discovered at .amp/config.toml if it exists. You can override this by passing --config <path> or setting the AMP_CONFIG environment variable. For other commands (server, worker, controller), the --config flag or AMP_CONFIG environment variable is required.

Sample Configuration

A complete sample configuration with all available options is provided in the source repository at docs/config.sample.toml. Copy this file and edit it to match your deployment requirements.
Configuration files are not mandatory. You can provide all configuration values through environment variables instead. See the Environment Variable Overrides section below.

Key Configuration Directories

Amp requires three object storage directories to be configured:
data_dir
string
required
Where the actual dataset parquet tables are stored once extracted. Can be initially empty.Supports both filesystem paths and object store URLs (S3, GCS, Azure).
data_dir = "data"
# or
data_dir = "s3://my-bucket/data"
manifests_dir
string
required
Directory containing dataset definitions (manifest JSON files). This is the input to the extraction process.
manifests_dir = "manifests"
providers_dir
string
required
Directory containing provider configurations for external services like Firehose and RPC endpoints. Each provider is configured as a separate TOML file.
providers_dir = "providers"
Although the initial setup with three directories may seem cumbersome, it allows for a highly flexible configuration where datasets, providers, and data can be stored in different locations or object stores.

Service Addresses

The following optional configuration keys control the hostname and port that each service binds to:
flight_addr
string
default:"0.0.0.0:1602"
Arrow Flight RPC server address for high-performance binary queries.
flight_addr = "0.0.0.0:1602"
jsonl_addr
string
default:"0.0.0.0:1603"
JSON Lines server address for HTTP-based queries.
jsonl_addr = "0.0.0.0:1603"
admin_api_addr
string
default:"0.0.0.0:1610"
Admin API server address for management operations.
admin_api_addr = "0.0.0.0:1610"

Environment Variable Overrides

All values in the configuration file can be overridden from the environment by prefixing the environment variable name with AMP_CONFIG_.

Top-Level Values

For top-level configuration values, use uppercase with the AMP_CONFIG_ prefix:
# Override data_dir
export AMP_CONFIG_DATA_DIR="s3://my-bucket/data"

# Override manifests_dir
export AMP_CONFIG_MANIFESTS_DIR="gs://my-bucket/manifests"

# Override providers_dir
export AMP_CONFIG_PROVIDERS_DIR="./providers"

Nested Configuration Values

For nested configuration values, use double underscores (__) to represent the nesting hierarchy:
# Override metadata_db.url
export AMP_CONFIG_METADATA_DB__URL="postgresql://user:pass@host/db"

# Override metadata_db.pool_size
export AMP_CONFIG_METADATA_DB__POOL_SIZE=20

# Override writer.compression
export AMP_CONFIG_WRITER__COMPRESSION="zstd(3)"

# Override opentelemetry.metrics_url
export AMP_CONFIG_OPENTELEMETRY__METRICS_URL="http://localhost:4318/v1/metrics"

Mixing Configuration File and Environment Variables

You can use a configuration file for base settings and override specific values with environment variables. This is useful for:
  • Development: Use a local config file with environment-specific overrides
  • Production: Store secrets in environment variables while keeping other config in files
  • CI/CD: Override database URLs and object store paths per environment

Memory and Performance

max_mem_mb
integer
default:"0"
Global memory limit for all queries in MB. A value of 0 means unlimited.
max_mem_mb = 8192  # 8GB total limit
query_max_mem_mb
integer
default:"0"
Per-query memory limit in MB. A value of 0 means unlimited per query.
query_max_mem_mb = 2048  # 2GB per query
spill_location
array
default:"[]"
Paths for DataFusion temporary files for spill-to-disk when memory limits are exceeded.
spill_location = ["/tmp/amp-spill", "/mnt/ssd/amp-spill"]

Operational Timing

poll_interval_secs
float
default:"1.0"
Polling interval for new blocks during extraction in seconds.
poll_interval_secs = 0.5  # Poll every 500ms
microbatch_max_interval
integer
default:"100000"
Maximum interval for derived dataset dump microbatches in blocks.
microbatch_max_interval = 50000
server_microbatch_max_interval
integer
default:"1000"
Maximum interval for streaming server microbatches in blocks.
server_microbatch_max_interval = 500
keep_alive_interval
integer
default:"30"
Keep-alive interval for streaming server in seconds. Minimum value is 30.
keep_alive_interval = 60

Logging

Logging verbosity is controlled by the AMP_LOG environment variable (not in the config file):
# Set log level (error, warn, info, debug, trace)
export AMP_LOG=info
For more fine-grained log filtering, use the standard RUST_LOG environment variable:
# Only show debug logs from the worker module
export RUST_LOG=amp_worker=debug,warn

Configuration Validation

Amp validates the configuration file on startup and will report errors if:
  • Required fields are missing
  • Field types are incorrect
  • Object store URLs are malformed
  • Service addresses are invalid
Check the startup logs for configuration validation errors.

Next Steps

Metadata Database

Configure PostgreSQL for metadata storage

Storage

Set up object storage backends

Telemetry

Configure OpenTelemetry and Grafana

Build docs developers (and LLMs) love