Production Deployment

This guide covers best practices for deploying Amp in production environments, including high availability setup, monitoring, security, backup strategies, and performance tuning.

Never use solo mode in production. Always use distributed mode with separate server, worker, and controller components.

Production Deployment Checklist

Use this checklist to ensure your production deployment is complete:

Infrastructure

PostgreSQL database with adequate resources

Object storage configured (S3, GCS, Azure Blob, or local filesystem)

Network connectivity between all components

Load balancers for server instances

Security

Controller deployed in private network

Database credentials in secrets manager

Object store credentials secured

TLS/SSL configured for all endpoints

Firewall rules limiting access

High Availability

Multiple server instances behind load balancer

Multiple worker instances for failover

PostgreSQL replication configured

Object store with redundancy

Monitoring

OpenTelemetry tracing configured

Metrics collection enabled

Alerting rules configured

Log aggregation setup

Performance

Compaction enabled in configuration

Garbage collection active

Memory limits configured

Database connection pool sized appropriately

Operational

Backup strategy implemented

Disaster recovery plan documented

Runbooks created

On-call rotation established

Architecture for High Availability

Recommended Production Topology

                    ┌─────────────────┐
                    │  Load Balancer  │
                    │  (Flight/JSONL) │
                    └────────┬────────┘
                             │
          ┌──────────────────┼──────────────────┐
          │                  │                  │
    ┌─────▼─────┐     ┌─────▼─────┐     ┌─────▼─────┐
    │Server (1) │     │Server (2) │     │Server (3) │
    │1602, 1603 │     │1602, 1603 │     │1602, 1603 │
    └───────────┘     └───────────┘     └───────────┘

    ┌──────────────────────────────────────────────┐
    │           Controller (Private)               │
    │              Port 1610                       │
    └──────────────────────────────────────────────┘

    ┌───────────┐     ┌───────────┐     ┌───────────┐
    │Worker (1) │     │Worker (2) │     │Worker (3) │
    │           │     │           │     │           │
    └───────────┘     └───────────┘     └───────────┘

         ┌────────────────────────────────┐
         │   PostgreSQL (Primary)         │
         │   + Read Replicas (optional)   │
         └────────────────────────────────┘

         ┌────────────────────────────────┐
         │   Object Store (S3/GCS/Azure)  │
         │   Multi-region replication     │
         └────────────────────────────────┘

Security Best Practices

Network Isolation

Controller Security (Port 1610)

The controller provides administrative capabilities and must be secured:

Deploy in private network only
Never expose to public internet
Use VPN or bastion host for access
Restrict source IPs via firewall rules
Place in management/admin subnet

Example AWS Security Group:

resource "aws_security_group" "controller" {
  ingress {
    from_port   = 1610
    to_port     = 1610
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]  # Internal only
  }
}

Server Security (Ports 1602, 1603)

Query servers can be public-facing but need protection:

Deploy in public subnet/DMZ if needed
Implement rate limiting
Use read-only database credentials
Configure query timeouts
Monitor for abuse

Consider adding authentication via:

API Gateway (Kong, Nginx, Traefik)
Mutual TLS (mTLS)
JWT token validation

Worker Security (No Exposed Ports)

Workers are internal components:

Deploy in private subnet
Only outbound connections needed
Require database write access
Need object store write access
No public IP addresses required

Secrets Management

Never commit secrets to version control. Use a secrets management system:

Kubernetes
AWS Secrets Manager
HashiCorp Vault

apiVersion: v1
kind: Secret
metadata:
  name: amp-secrets
type: Opaque
data:
  db-url: <base64-encoded-url>
  aws-secret-key: <base64-encoded-key>
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: amp-worker
spec:
  template:
    spec:
      containers:
      - name: worker
        env:
        - name: AMP_CONFIG_METADATA_DB__URL
          valueFrom:
            secretKeyRef:
              name: amp-secrets
              key: db-url

# Store secret
aws secretsmanager create-secret \
  --name amp/database-url \
  --secret-string "postgresql://user:password@host:5432/amp"

# Retrieve in startup script
export AMP_CONFIG_METADATA_DB__URL=$(
  aws secretsmanager get-secret-value \
    --secret-id amp/database-url \
    --query SecretString \
    --output text
)

ampd worker --config config.toml --node-id worker-01

# Store secret
vault kv put secret/amp/database \
  url="postgresql://user:password@host:5432/amp"

# Retrieve in startup script
export AMP_CONFIG_METADATA_DB__URL=$(
  vault kv get -field=url secret/amp/database
)

ampd worker --config config.toml --node-id worker-01

TLS/SSL Configuration

Secure all network communication:

# config.toml

# PostgreSQL with SSL
[metadata_db]
url = "postgresql://user:password@host:5432/amp?sslmode=require"

# For object store (S3)
# AWS SDK uses TLS by default

For query endpoints, terminate TLS at load balancer or reverse proxy:

# Nginx reverse proxy for Flight server
upstream amp_flight {
    server amp-server-1:1602;
    server amp-server-2:1602;
    server amp-server-3:1602;
}

server {
    listen 443 ssl http2;
    server_name flight.example.com;
    
    ssl_certificate /etc/ssl/certs/flight.crt;
    ssl_certificate_key /etc/ssl/private/flight.key;
    
    location / {
        grpc_pass grpc://amp_flight;
    }
}

High Availability Setup

Component Redundancy

Component	Min Instances	Recommended	Notes
Server	2	3+	Behind load balancer
Controller	1	1-2	Active-passive or active-active
Worker	2	3+	For job failover
PostgreSQL	1	1 primary + 2 replicas	With automatic failover

PostgreSQL High Availability

Use managed PostgreSQL with HA features or set up replication:

AWS RDS
Self-Managed

resource "aws_db_instance" "amp" {
  identifier     = "amp-postgres"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.r6g.2xlarge"
  
  multi_az               = true  # High availability
  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "sun:04:00-sun:05:00"
  
  allocated_storage     = 500
  storage_type         = "gp3"
  storage_encrypted    = true
}

Use PostgreSQL streaming replication:

# postgresql.conf (primary)
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB

# pg_hba.conf (primary)
host replication replicator 10.0.0.0/8 md5

Setup failover with tools like:

Patroni
Stolon
pg_auto_failover

Object Store Redundancy

Configure multi-region replication:

AWS S3
GCS

resource "aws_s3_bucket" "amp_data" {
  bucket = "amp-production-data"
}

resource "aws_s3_bucket_versioning" "amp_data" {
  bucket = aws_s3_bucket.amp_data.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_replication_configuration" "amp_data" {
  bucket = aws_s3_bucket.amp_data.id
  role   = aws_iam_role.replication.arn
  
  rule {
    id     = "replicate-all"
    status = "Enabled"
    
    destination {
      bucket        = aws_s3_bucket.amp_data_backup.arn
      storage_class = "STANDARD_IA"
    }
  }
}

# Enable versioning
gcloud storage buckets update gs://amp-production-data \
  --versioning

# Setup dual-region bucket
gcloud storage buckets create gs://amp-production-data \
  --location=us \
  --storage-class=STANDARD

Monitoring and Alerting

OpenTelemetry Configuration

Enable comprehensive observability:

# config.toml
[opentelemetry]
trace_url = "http://otel-collector:4318/v1/traces"
metrics_url = "http://otel-collector:4318/v1/metrics"
trace_ratio = 1.0
metrics_export_interval_secs = 60.0

Key Metrics to Monitor

System Metrics

CPU Usage: Per component (server, worker, controller)
Memory Usage: Track against configured limits
Disk I/O: Object store and database throughput
Network: Bandwidth usage and latency

Worker Metrics

Active Jobs: Number of extraction jobs running
Job Duration: Time to complete jobs
Block Processing Rate: Blocks extracted per second
Worker Heartbeat: Health signal frequency
Failed Jobs: Job failure rate and reasons

Query Metrics

Query Latency: P50, P95, P99 response times
Query Throughput: Queries per second
Active Connections: Current query connections
Error Rate: Failed queries percentage

Database Metrics

Connection Pool: Active/idle connections
Query Performance: Slow query log
Replication Lag: For read replicas
Disk Usage: Database size growth

Object Store Metrics

File Count: Parquet files stored
Storage Size: Total data volume
Write Throughput: Bytes written per second
Read Throughput: Query read performance

Alert Rules

Configure alerts for critical conditions:

# Prometheus alert rules
groups:
  - name: amp_alerts
    rules:
      # Worker health
      - alert: WorkerDown
        expr: up{job="amp-worker"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Worker {{ $labels.instance }} is down"
      
      # Job failures
      - alert: HighJobFailureRate
        expr: rate(amp_job_failures_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High job failure rate detected"
      
      # Query latency
      - alert: HighQueryLatency
        expr: histogram_quantile(0.95, rate(amp_query_duration_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile query latency above 10s"
      
      # Database connections
      - alert: DatabaseConnectionPoolExhausted
        expr: amp_db_connections_active / amp_db_connections_max > 0.9
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool nearly exhausted"

Backup and Recovery

Backup Strategy

PostgreSQL Backups

Frequency: Daily full backups + continuous WAL archiving

# Automated backup with pg_dump
pg_dump -h db-host -U amp_user -F c -f amp-backup-$(date +%Y%m%d).dump amp

# Upload to S3
aws s3 cp amp-backup-$(date +%Y%m%d).dump s3://amp-backups/postgres/

Retention: 30 days for daily, 12 months for monthly

Object Store Backups

Parquet files in object store are immutable and versioned:

Enable versioning on S3/GCS bucket
Configure lifecycle policies for old versions
Use cross-region replication for disaster recovery

# S3 lifecycle policy
aws s3api put-bucket-lifecycle-configuration \
  --bucket amp-production-data \
  --lifecycle-configuration file://lifecycle.json

Configuration Backups

Store configuration in version control:

config.toml files
Provider configurations
Dataset manifests

Exclude sensitive values (use environment variables).

Disaster Recovery

Recovery Time Objective (RTO): Target recovery time Recovery Point Objective (RPO): Acceptable data loss

Recovery Procedures

Database Failure
Object Store Failure
Complete Region Failure

Identify failure (primary database down)
Promote read replica to primary (if using HA setup)
Update connection strings to new primary
Restart all Amp components to reconnect
Verify worker heartbeats and job resumption

# Update config for new database
export AMP_CONFIG_METADATA_DB__URL="postgresql://new-primary:5432/amp"

# Restart components
systemctl restart amp-controller
systemctl restart amp-server
systemctl restart amp-worker@*

Switch to backup region/bucket
Update configuration with new object store location
Verify data accessibility
Resume worker operations

For S3:

# Update to failover bucket
export AWS_S3_BUCKET="amp-data-backup-region"

Performance Tuning

Configuration Optimization

# config.toml - Production optimized

# Memory limits
max_mem_mb = 16384          # 16GB global limit
query_max_mem_mb = 4096     # 4GB per query
spill_location = ["/tmp/amp-spill"]

# Timing
poll_interval_secs = 1.0
microbatch_max_interval = 50000
server_microbatch_max_interval = 1000
keep_alive_interval = 60

# Database
[metadata_db]
url = "postgresql://amp:password@db:5432/amp"
pool_size = 20              # Increase for high concurrency
auto_migrate = false        # Disable in production

# Writer optimization
[writer]
compression = "zstd(3)"     # Higher compression for production
bloom_filters = true        # Enable for better query performance
cache_size_mb = 2048        # Larger metadata cache
max_row_group_mb = 512
segment_flush_interval_secs = 300.0

# Compaction (essential for production)
[writer.compactor]
active = true
metadata_concurrency = 4
write_concurrency = 4
min_interval = 1.0
cooldown_duration = 512.0
overflow = "1.2"
bytes = 2147483648          # 2GB target

# Garbage collection
[writer.collector]
active = true
min_interval = 30.0
deletion_lock_duration = 3600.0  # 1 hour

PostgreSQL Tuning

-- postgresql.conf optimizations

-- Memory
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 32MB
maintenance_work_mem = 2GB

-- Checkpoints
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100

-- Query planner
random_page_cost = 1.1
effective_io_concurrency = 200

-- Connections
max_connections = 200

Load Testing

Test your deployment before going live:

# Query load test with Apache Bench
ab -n 10000 -c 50 -p query.json \
  -T application/json \
  http://localhost:1603/

# Worker stress test - deploy multiple datasets
for i in {1..10}; do
  curl -X POST http://localhost:1610/datasets/test/dataset-$i/versions/1.0.0/deploy
done

# Monitor metrics during load
curl http://localhost:1610/metrics

Operational Runbooks

Common Operations

Adding a Worker

Deploy new worker instance with unique node ID
Ensure connectivity to database and object store
Start worker process
Verify registration via Admin API

# Start new worker
ampd worker --config config.toml --node-id worker-new-01

# Verify registration
curl http://localhost:1610/locations | jq '.[] | select(.node_id == "worker-new-01")'

Removing a Worker

Stop assigning new jobs (graceful shutdown)
Wait for current jobs to complete
Send SIGTERM to worker process
Verify deregistration

# Graceful shutdown
kill -TERM $(pidof ampd)

# Verify worker removed
curl http://localhost:1610/locations

Scaling Query Servers

Deploy additional server instances
Configure same backend (database + object store)
Add to load balancer pool
Verify health checks pass

# Start additional server
ampd server --config config.toml

# Add to load balancer (example with HAProxy)
echo "server server4 10.0.1.14:1602 check" >> /etc/haproxy/haproxy.cfg
systemctl reload haproxy

Database Maintenance

During maintenance window:

Stop controller to prevent new jobs
Allow workers to complete current jobs
Perform database maintenance
Restart components

# Stop job scheduling
systemctl stop amp-controller

# Wait for workers to finish
curl http://localhost:1610/jobs | jq '.[] | select(.status == "running")'

# Perform maintenance
psql -h db-host -U postgres -c "VACUUM ANALYZE;"

# Restart
systemctl start amp-controller

Cost Optimization

Resource Right-Sizing

Component	CPU	Memory	Storage	Notes
Server	4-8 cores	8-16 GB	Minimal	Scale horizontally
Controller	2-4 cores	4-8 GB	Minimal	Usually single instance
Worker	8-16 cores	16-32 GB	Temp storage	Scale based on throughput
PostgreSQL	8-16 cores	32-64 GB	500GB-2TB	Critical for metadata

Object Storage Optimization

Use lifecycle policies to move old data to cheaper tiers
Enable compression (already configured in writer)
Consider intelligent tiering for infrequently accessed data

// S3 lifecycle policy
{
  "Rules": [
    {
      "Id": "MoveToIA",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 90,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 365,
          "StorageClass": "GLACIER"
        }
      ]
    }
  ]
}

Next Steps

Monitoring Setup

Configure comprehensive monitoring

Configuration Reference

Complete configuration options

Troubleshooting

Resolve common production issues

Get Started

Core Concepts

Configuration

Querying Data

Data Sources

Administration

Deployment

​Production Deployment Checklist

​Architecture for High Availability

​Recommended Production Topology

​Security Best Practices

​Network Isolation

​Secrets Management

​TLS/SSL Configuration

​High Availability Setup

​Component Redundancy

​PostgreSQL High Availability

​Object Store Redundancy

​Monitoring and Alerting

​OpenTelemetry Configuration

​Key Metrics to Monitor

​Alert Rules

​Backup and Recovery

​Backup Strategy

​Disaster Recovery

​Recovery Procedures

​Performance Tuning

​Configuration Optimization

​PostgreSQL Tuning

​Load Testing

​Operational Runbooks

​Common Operations

​Cost Optimization

​Resource Right-Sizing

​Object Storage Optimization

​Next Steps

Monitoring Setup

Configuration Reference

Troubleshooting

Build docs developers (and LLMs) love

Production Deployment Checklist

Architecture for High Availability

Recommended Production Topology

Security Best Practices

Network Isolation

Secrets Management

TLS/SSL Configuration

High Availability Setup

Component Redundancy

PostgreSQL High Availability

Object Store Redundancy

Monitoring and Alerting

OpenTelemetry Configuration

Key Metrics to Monitor

Alert Rules

Backup and Recovery

Backup Strategy

Disaster Recovery

Recovery Procedures

Performance Tuning

Configuration Optimization

PostgreSQL Tuning

Load Testing

Operational Runbooks

Common Operations

Cost Optimization

Resource Right-Sizing

Object Storage Optimization

Next Steps