Upgrading Garage

Garage is a stateful clustered application where all nodes communicate and share data structures. This makes upgrades more complex than stateless applications, requiring careful planning and execution.

Understanding Upgrade Types

There are two types of upgrades:

Minor upgrade: Protocols and data structures remain the same
Major upgrade: Protocols or data structures changed

Version Numbering

You can identify the upgrade type from the version number:

Major upgrade: First nonzero component changes (e.g., v0.7.2 → v0.8.0)
Minor upgrade: First nonzero component stays the same (e.g., v0.8.0 → v0.8.1)

Major upgrades must be run between contiguous versions only.Supported:

v0.7.1 → v0.8.0 ✓
v0.7.0 → v0.8.2 ✓

Not supported:

v0.6.0 → v0.8.0 ✗

Monitoring Current Versions

The garage_build_info Prometheus metric shows which Garage versions are running in your cluster:

garage_build_info{version="1.0.0"} 1

See the Monitoring guide for more details.

Minor Upgrades

Minor upgrades do not require cluster downtime.

Preparation

Read the changelog at git.deuxfleurs.fr/Deuxfleurs/garage/releases
Test on a staging cluster if possible
Check cluster health:

garage repair --all-nodes --yes tables

This runs quickly (less than a minute) and verifies metadata consistency.

Verify repairs completed:

Check daemon logs or run:

garage worker list

Repair workers should be in the Done state.

Upgrade Process

Upgrade nodes one by one:

Stop the Garage daemon
Install the new binary
Update configuration if needed
Restart the daemon
Repeat for next node

The cluster remains available during the entire process. Take your time between nodes to monitor for any issues.

Major Upgrades

Major upgrades can be done with minimal downtime with preparation, but the simplest method is putting the cluster offline during migration.

Before a major upgrade:

Read the changelog at git.deuxfleurs.fr/Deuxfleurs/garage/releases
Test on a staging cluster first
Locate version-specific migration guides in the “Working Documents” section

Method 1: Full Downtime (Recommended)

This is the safest approach:

Step 1: Preparation

Disable API access (in reverse proxy or configuration)
Verify cluster is idle - check for active requests
Check cluster health:

garage repair

Step 2: Shutdown and Backup

Stop all Garage nodes
Backup metadata folders on all nodes

Data blocks are immutable and don’t need backing up, but metadata must be preserved to enable rollback.

Step 3: Upgrade

Install new binary and update configuration on all nodes
Start all Garage nodes
Run migrations if needed:

garage migrate

Check version-specific documentation for required migrations.

Step 4: Verification

Check cluster health:

garage repair
garage status

Re-enable API access
Monitor cluster load and application behavior

Method 2: Minimal Downtime (Advanced)

Minimal downtime is possible by coordinating a simultaneous restart of all nodes.

The downtime is limited to the time needed for all nodes to stop and start (typically less than a minute).

Step 1: Preparation

Check cluster health:

garage repair

Backup metadata on all nodes:

Option A: Snapshot each node individually Take nodes offline one at a time to back up their metadata folder. You can do all nodes in a single zone at once without impacting global availability.

Never manually copy the metadata folder of a running node.

Option B: Use Garage snapshots (v0.9.4+)

garage meta snapshot --all

This creates simultaneous snapshots across all nodes without taking them offline.

If automatic snapshotting is enabled, Garage only keeps the last two snapshots. Consider disabling automatic snapshots until the upgrade is confirmed successful.

Also back up the cluster_layout file from any node (it’s the same on all nodes and can be copied while Garage is running).

Step 2: Preparation

Prepare new binaries and configuration files on all nodes

Step 3: Coordinated Restart

Restart all nodes simultaneously in the new version

If nodes fail to restart simultaneously, some nodes might be temporarily shut out as different RPC protocol versions cannot communicate.

Step 4: Post-Upgrade

Run required migrations per version-specific documentation

Migrations are typically one of two types:

Online: Can run on live nodes during normal operation
Offline: Requires taking nodes offline again one by one

Troubleshooting

Nodes Not Communicating After Upgrade

Cause: Nodes upgraded at different times using incompatible RPC protocols. Solution: Complete the upgrade on all remaining nodes as quickly as possible.

Migration Fails

Cause: Cluster state incompatible with new version. Solution:

Review migration logs for specific errors
Restore from metadata backups if necessary
Consult version-specific upgrade documentation

Cluster Performance Degraded

Cause: New version resyncing data or running background migrations. Solution:

Check garage stats -a for ongoing operations
Monitor garage worker list for active background tasks
Allow time for stabilization (may take hours for large clusters)

Version-Specific Guides

Major version upgrades may require special procedures. Check the “Working Documents” section for:

v0.7.x → v0.8.x migration guide
v0.8.x → v0.9.x migration guide
v0.9.x → v1.0.x migration guide

Best Practices

Always test upgrades in a staging environment first
Back up metadata before any major upgrade
Read the changelog thoroughly
Monitor during and after upgrades
Upgrade during low-traffic periods
Document your upgrade procedure for your specific deployment
Have a rollback plan with metadata backups ready

Rollback Procedure

If an upgrade fails:

Stop all Garage nodes
Restore metadata backups on all nodes
Reinstall previous version binaries
Restore previous configuration files
Restart all nodes
Verify cluster health:

garage status
garage repair --all-nodes --yes tables

Get Started

Design & Architecture

Installation

Deployment

Operations

Integration

Advanced

Upgrading Garage

Understanding Upgrade Types

Version Numbering

Monitoring Current Versions

Minor Upgrades

Preparation

Upgrade Process

Major Upgrades

Method 1: Full Downtime (Recommended)

Step 1: Preparation

Step 2: Shutdown and Backup

Step 3: Upgrade

Step 4: Verification

Method 2: Minimal Downtime (Advanced)

Step 1: Preparation

Step 2: Preparation

Step 3: Coordinated Restart

Step 4: Post-Upgrade

Troubleshooting

Nodes Not Communicating After Upgrade

Migration Fails

Cluster Performance Degraded

Version-Specific Guides

Best Practices

Rollback Procedure

See Also

Build docs developers (and LLMs) love

Get Started

Design & Architecture

Installation

Deployment

Operations

Integration

Advanced

Documentation Index

​Understanding Upgrade Types

​Version Numbering

​Monitoring Current Versions

​Minor Upgrades

​Preparation

​Upgrade Process

​Major Upgrades

​Method 1: Full Downtime (Recommended)

​Step 1: Preparation

​Step 2: Shutdown and Backup

​Step 3: Upgrade

​Step 4: Verification

​Method 2: Minimal Downtime (Advanced)

​Step 1: Preparation

​Step 2: Preparation

​Step 3: Coordinated Restart

​Step 4: Post-Upgrade

​Troubleshooting

​Nodes Not Communicating After Upgrade

​Migration Fails

​Cluster Performance Degraded

​Version-Specific Guides

​Best Practices

​Rollback Procedure

​See Also

Build docs developers (and LLMs) love

Understanding Upgrade Types

Version Numbering

Monitoring Current Versions

Minor Upgrades

Preparation

Upgrade Process

Major Upgrades

Method 1: Full Downtime (Recommended)

Step 1: Preparation

Step 2: Shutdown and Backup

Step 3: Upgrade

Step 4: Verification

Method 2: Minimal Downtime (Advanced)

Step 1: Preparation

Step 2: Preparation

Step 3: Coordinated Restart

Step 4: Post-Upgrade

Troubleshooting

Nodes Not Communicating After Upgrade

Migration Fails

Cluster Performance Degraded

Version-Specific Guides

Best Practices

Rollback Procedure

See Also