Skip to main content
This guide covers best practices for deploying Amp in production environments, including high availability setup, monitoring, security, backup strategies, and performance tuning.
Never use solo mode in production. Always use distributed mode with separate server, worker, and controller components.

Production Deployment Checklist

Use this checklist to ensure your production deployment is complete:
1

Infrastructure

PostgreSQL database with adequate resources
Object storage configured (S3, GCS, Azure Blob, or local filesystem)
Network connectivity between all components
Load balancers for server instances
2

Security

Controller deployed in private network
Database credentials in secrets manager
Object store credentials secured
TLS/SSL configured for all endpoints
Firewall rules limiting access
3

High Availability

Multiple server instances behind load balancer
Multiple worker instances for failover
PostgreSQL replication configured
Object store with redundancy
4

Monitoring

OpenTelemetry tracing configured
Metrics collection enabled
Alerting rules configured
Log aggregation setup
5

Performance

Compaction enabled in configuration
Garbage collection active
Memory limits configured
Database connection pool sized appropriately
6

Operational

Backup strategy implemented
Disaster recovery plan documented
Runbooks created
On-call rotation established

Architecture for High Availability

                    ┌─────────────────┐
                    │  Load Balancer  │
                    │  (Flight/JSONL) │
                    └────────┬────────┘

          ┌──────────────────┼──────────────────┐
          │                  │                  │
    ┌─────▼─────┐     ┌─────▼─────┐     ┌─────▼─────┐
    │Server (1) │     │Server (2) │     │Server (3) │
    │1602, 1603 │     │1602, 1603 │     │1602, 1603 │
    └───────────┘     └───────────┘     └───────────┘

    ┌──────────────────────────────────────────────┐
    │           Controller (Private)               │
    │              Port 1610                       │
    └──────────────────────────────────────────────┘

    ┌───────────┐     ┌───────────┐     ┌───────────┐
    │Worker (1) │     │Worker (2) │     │Worker (3) │
    │           │     │           │     │           │
    └───────────┘     └───────────┘     └───────────┘

         ┌────────────────────────────────┐
         │   PostgreSQL (Primary)         │
         │   + Read Replicas (optional)   │
         └────────────────────────────────┘

         ┌────────────────────────────────┐
         │   Object Store (S3/GCS/Azure)  │
         │   Multi-region replication     │
         └────────────────────────────────┘

Security Best Practices

Network Isolation

The controller provides administrative capabilities and must be secured:
  • Deploy in private network only
  • Never expose to public internet
  • Use VPN or bastion host for access
  • Restrict source IPs via firewall rules
  • Place in management/admin subnet
Example AWS Security Group:
resource "aws_security_group" "controller" {
  ingress {
    from_port   = 1610
    to_port     = 1610
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]  # Internal only
  }
}
Query servers can be public-facing but need protection:
  • Deploy in public subnet/DMZ if needed
  • Implement rate limiting
  • Use read-only database credentials
  • Configure query timeouts
  • Monitor for abuse
Consider adding authentication via:
  • API Gateway (Kong, Nginx, Traefik)
  • Mutual TLS (mTLS)
  • JWT token validation
Workers are internal components:
  • Deploy in private subnet
  • Only outbound connections needed
  • Require database write access
  • Need object store write access
  • No public IP addresses required

Secrets Management

Never commit secrets to version control. Use a secrets management system:
apiVersion: v1
kind: Secret
metadata:
  name: amp-secrets
type: Opaque
data:
  db-url: <base64-encoded-url>
  aws-secret-key: <base64-encoded-key>
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: amp-worker
spec:
  template:
    spec:
      containers:
      - name: worker
        env:
        - name: AMP_CONFIG_METADATA_DB__URL
          valueFrom:
            secretKeyRef:
              name: amp-secrets
              key: db-url

TLS/SSL Configuration

Secure all network communication:
# config.toml

# PostgreSQL with SSL
[metadata_db]
url = "postgresql://user:password@host:5432/amp?sslmode=require"

# For object store (S3)
# AWS SDK uses TLS by default
For query endpoints, terminate TLS at load balancer or reverse proxy:
# Nginx reverse proxy for Flight server
upstream amp_flight {
    server amp-server-1:1602;
    server amp-server-2:1602;
    server amp-server-3:1602;
}

server {
    listen 443 ssl http2;
    server_name flight.example.com;
    
    ssl_certificate /etc/ssl/certs/flight.crt;
    ssl_certificate_key /etc/ssl/private/flight.key;
    
    location / {
        grpc_pass grpc://amp_flight;
    }
}

High Availability Setup

Component Redundancy

ComponentMin InstancesRecommendedNotes
Server23+Behind load balancer
Controller11-2Active-passive or active-active
Worker23+For job failover
PostgreSQL11 primary + 2 replicasWith automatic failover

PostgreSQL High Availability

Use managed PostgreSQL with HA features or set up replication:
resource "aws_db_instance" "amp" {
  identifier     = "amp-postgres"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.r6g.2xlarge"
  
  multi_az               = true  # High availability
  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "sun:04:00-sun:05:00"
  
  allocated_storage     = 500
  storage_type         = "gp3"
  storage_encrypted    = true
}

Object Store Redundancy

Configure multi-region replication:
resource "aws_s3_bucket" "amp_data" {
  bucket = "amp-production-data"
}

resource "aws_s3_bucket_versioning" "amp_data" {
  bucket = aws_s3_bucket.amp_data.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_replication_configuration" "amp_data" {
  bucket = aws_s3_bucket.amp_data.id
  role   = aws_iam_role.replication.arn
  
  rule {
    id     = "replicate-all"
    status = "Enabled"
    
    destination {
      bucket        = aws_s3_bucket.amp_data_backup.arn
      storage_class = "STANDARD_IA"
    }
  }
}

Monitoring and Alerting

OpenTelemetry Configuration

Enable comprehensive observability:
# config.toml
[opentelemetry]
trace_url = "http://otel-collector:4318/v1/traces"
metrics_url = "http://otel-collector:4318/v1/metrics"
trace_ratio = 1.0
metrics_export_interval_secs = 60.0

Key Metrics to Monitor

  • CPU Usage: Per component (server, worker, controller)
  • Memory Usage: Track against configured limits
  • Disk I/O: Object store and database throughput
  • Network: Bandwidth usage and latency
  • Active Jobs: Number of extraction jobs running
  • Job Duration: Time to complete jobs
  • Block Processing Rate: Blocks extracted per second
  • Worker Heartbeat: Health signal frequency
  • Failed Jobs: Job failure rate and reasons
  • Query Latency: P50, P95, P99 response times
  • Query Throughput: Queries per second
  • Active Connections: Current query connections
  • Error Rate: Failed queries percentage
  • Connection Pool: Active/idle connections
  • Query Performance: Slow query log
  • Replication Lag: For read replicas
  • Disk Usage: Database size growth
  • File Count: Parquet files stored
  • Storage Size: Total data volume
  • Write Throughput: Bytes written per second
  • Read Throughput: Query read performance

Alert Rules

Configure alerts for critical conditions:
# Prometheus alert rules
groups:
  - name: amp_alerts
    rules:
      # Worker health
      - alert: WorkerDown
        expr: up{job="amp-worker"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Worker {{ $labels.instance }} is down"
      
      # Job failures
      - alert: HighJobFailureRate
        expr: rate(amp_job_failures_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High job failure rate detected"
      
      # Query latency
      - alert: HighQueryLatency
        expr: histogram_quantile(0.95, rate(amp_query_duration_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile query latency above 10s"
      
      # Database connections
      - alert: DatabaseConnectionPoolExhausted
        expr: amp_db_connections_active / amp_db_connections_max > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool nearly exhausted"

Backup and Recovery

Backup Strategy

1

PostgreSQL Backups

Frequency: Daily full backups + continuous WAL archiving
# Automated backup with pg_dump
pg_dump -h db-host -U amp_user -F c -f amp-backup-$(date +%Y%m%d).dump amp

# Upload to S3
aws s3 cp amp-backup-$(date +%Y%m%d).dump s3://amp-backups/postgres/
Retention: 30 days for daily, 12 months for monthly
2

Object Store Backups

Parquet files in object store are immutable and versioned:
  • Enable versioning on S3/GCS bucket
  • Configure lifecycle policies for old versions
  • Use cross-region replication for disaster recovery
# S3 lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
  --bucket amp-production-data \
  --lifecycle-configuration file://lifecycle.json
3

Configuration Backups

Store configuration in version control:
  • config.toml files
  • Provider configurations
  • Dataset manifests
Exclude sensitive values (use environment variables).

Disaster Recovery

Recovery Time Objective (RTO): Target recovery time Recovery Point Objective (RPO): Acceptable data loss

Recovery Procedures

  1. Identify failure (primary database down)
  2. Promote read replica to primary (if using HA setup)
  3. Update connection strings to new primary
  4. Restart all Amp components to reconnect
  5. Verify worker heartbeats and job resumption
# Update config for new database
export AMP_CONFIG_METADATA_DB__URL="postgresql://new-primary:5432/amp"

# Restart components
systemctl restart amp-controller
systemctl restart amp-server
systemctl restart amp-worker@*

Performance Tuning

Configuration Optimization

# config.toml - Production optimized

# Memory limits
max_mem_mb = 16384          # 16GB global limit
query_max_mem_mb = 4096     # 4GB per query
spill_location = ["/tmp/amp-spill"]

# Timing
poll_interval_secs = 1.0
microbatch_max_interval = 50000
server_microbatch_max_interval = 1000
keep_alive_interval = 60

# Database
[metadata_db]
url = "postgresql://amp:password@db:5432/amp"
pool_size = 20              # Increase for high concurrency
auto_migrate = false        # Disable in production

# Writer optimization
[writer]
compression = "zstd(3)"     # Higher compression for production
bloom_filters = true        # Enable for better query performance
cache_size_mb = 2048        # Larger metadata cache
max_row_group_mb = 512
segment_flush_interval_secs = 300.0

# Compaction (essential for production)
[writer.compactor]
active = true
metadata_concurrency = 4
write_concurrency = 4
min_interval = 1.0
cooldown_duration = 512.0
overflow = "1.2"
bytes = 2147483648          # 2GB target

# Garbage collection
[writer.collector]
active = true
min_interval = 30.0
deletion_lock_duration = 3600.0  # 1 hour

PostgreSQL Tuning

-- postgresql.conf optimizations

-- Memory
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 32MB
maintenance_work_mem = 2GB

-- Checkpoints
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100

-- Query planner
random_page_cost = 1.1
effective_io_concurrency = 200

-- Connections
max_connections = 200

Load Testing

Test your deployment before going live:
# Query load test with Apache Bench
ab -n 10000 -c 50 -p query.json \
  -T application/json \
  http://localhost:1603/

# Worker stress test - deploy multiple datasets
for i in {1..10}; do
  curl -X POST http://localhost:1610/datasets/test/dataset-$i/versions/1.0.0/deploy
done

# Monitor metrics during load
curl http://localhost:1610/metrics

Operational Runbooks

Common Operations

  1. Deploy new worker instance with unique node ID
  2. Ensure connectivity to database and object store
  3. Start worker process
  4. Verify registration via Admin API
# Start new worker
ampd worker --config config.toml --node-id worker-new-01

# Verify registration
curl http://localhost:1610/locations | jq '.[] | select(.node_id == "worker-new-01")'
  1. Stop assigning new jobs (graceful shutdown)
  2. Wait for current jobs to complete
  3. Send SIGTERM to worker process
  4. Verify deregistration
# Graceful shutdown
kill -TERM $(pidof ampd)

# Verify worker removed
curl http://localhost:1610/locations
  1. Deploy additional server instances
  2. Configure same backend (database + object store)
  3. Add to load balancer pool
  4. Verify health checks pass
# Start additional server
ampd server --config config.toml

# Add to load balancer (example with HAProxy)
echo "server server4 10.0.1.14:1602 check" >> /etc/haproxy/haproxy.cfg
systemctl reload haproxy
During maintenance window:
  1. Stop controller to prevent new jobs
  2. Allow workers to complete current jobs
  3. Perform database maintenance
  4. Restart components
# Stop job scheduling
systemctl stop amp-controller

# Wait for workers to finish
curl http://localhost:1610/jobs | jq '.[] | select(.status == "running")'

# Perform maintenance
psql -h db-host -U postgres -c "VACUUM ANALYZE;"

# Restart
systemctl start amp-controller

Cost Optimization

Resource Right-Sizing

ComponentCPUMemoryStorageNotes
Server4-8 cores8-16 GBMinimalScale horizontally
Controller2-4 cores4-8 GBMinimalUsually single instance
Worker8-16 cores16-32 GBTemp storageScale based on throughput
PostgreSQL8-16 cores32-64 GB500GB-2TBCritical for metadata

Object Storage Optimization

  • Use lifecycle policies to move old data to cheaper tiers
  • Enable compression (already configured in writer)
  • Consider intelligent tiering for infrequently accessed data
// S3 lifecycle policy
{
  "Rules": [
    {
      "Id": "MoveToIA",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 365,
          "StorageClass": "GLACIER"
        }
      ]
    }
  ]
}

Next Steps

Monitoring Setup

Configure comprehensive monitoring

Configuration Reference

Complete configuration options

Troubleshooting

Resolve common production issues

Build docs developers (and LLMs) love