Disaster Recovery

Garage is designed to work on old, second-hand hardware, which means drive failures are expected. Fortunately, Garage is fully equipped to handle most common failure scenarios.

Availability Guarantees

With nodes dispersed in 3 or more zones and 3-way replication (recommended):

Cluster remains fully functional if failures occur in only one zone
Cluster can handle a complete zone outage (power/Internet)
No data is lost if failures occur in at most two zones

These guarantees only work if your nodes are correctly configured with their zone information. Verify with:

garage status

Temporarily disconnected nodes automatically re-synchronize when they come back online. This guide focuses on permanent failures requiring manual intervention.

Recovery Scenario 1: Removing a Failed Node

Use this when:

You have no spare parts to replace failed components
At least 3 nodes remain in the cluster
You do not plan to replace the node

If you plan to replace the failed hardware, use Scenario 2 or 3 instead. Removing and re-adding nodes causes unnecessary data reshuffling.

Procedure

Get the failed node ID:

garage status

Look for disconnected/unavailable nodes.

Remove the node:

garage layout remove <node_id>

Review the changes:

garage layout show

Apply the changes:

garage layout apply --version <new_version>

Garage will repartition data to ensure 3 copies of everything exist on remaining nodes.

Recovery Scenario 2: Data Lost, Metadata Intact

Use this when:

Only the HDD storing data blocks failed
The SSD with metadata is still working
Node configuration unchanged

This is the easiest recovery scenario - no cluster reconfiguration needed.

Procedure

Replace the failed HDD and mount it at the same path
Restart Garage with the existing configuration
Trigger data resync:

garage repair -a --yes blocks

This re-synchronizes missing data blocks from other nodes.

Monitor progress:

garage stats -a

Look for:

Block manager stats:
  resync queue length: 26541

This number decreases to zero when the node is fully synchronized.

Depending on the amount of data, this process may take several hours to complete.

Recovery Scenario 3: Metadata Lost

Use this when:

Full node failure (both metadata and data lost)
Only metadata directory lost
Metadata database corrupted

When Both Metadata and Data Are Lost

Starting from an empty metadata directory means Garage generates a new node ID. You’ll need to update the cluster layout.

Procedure

Set up replacement drives:
- New SSD for metadata (recommended)
- New HDD for data (if needed)
Start Garage on the new node
Verify new node ID:

garage status

The new node shows as NO ROLE ASSIGNED. The old node ID appears in the disconnected section.

Replace the old node with the new one:

garage layout assign <new_node_id> \
  --replace <old_node_id> \
  --zone <zone> \
  --capacity <capacity> \
  --tag <node_tag>

Example:

garage layout assign a11c7cf18af29737 \
  --replace b10c110e4e854e5a \
  --zone dc1 \
  --capacity 2TB \
  --tag node1

Review and apply:

garage layout show
garage layout apply --version <new_version>

Monitor synchronization:

garage stats -a

Garage will synchronize all required data onto the new node.

When Only Metadata Is Corrupted

Use this when:

Your metadata database file is corrupted (e.g., after power outage)
The node didn’t shut down properly
You want to avoid changing node IDs

Recovering from corruption without changing node IDs means data blocks don’t need to be reshuffled - you only need to restore the metadata.

Locate Your Database File

Database location depends on your db_engine setting:

LMDB: <metadata_dir>/db.lmdb/
Sqlite: <metadata_dir>/db.sqlite

See Configuration Reference for details.

Recovery Options

Option 1: Resync from Other Nodes If your cluster has 2 or 3 copies of all data:

Stop Garage
Delete the corrupted database:

# For LMDB
rm -rf /path/to/metadata/db.lmdb

# For Sqlite
rm /path/to/metadata/db.sqlite

Restart Garage
Repair metadata tables:

garage repair -a --yes tables

The node receives copies of metadata tables from the network. This may take a few minutes to complete. Option 2: Restore Garage Snapshot (v0.9.4+) If you have automatic snapshots enabled:

Locate snapshots:

ls -la <metadata_dir>/snapshots/

Snapshots are named by UTC timestamp (e.g., 2024-03-15T12:13:52Z).

Stop Garage
Restore the snapshot:

For LMDB:

cd $METADATA_DIR
mv db.lmdb db.lmdb.bak
cp -r snapshots/2024-03-15T12:13:52Z db.lmdb

For Sqlite:

cd $METADATA_DIR
mv db.sqlite db.sqlite.bak
cp snapshots/2024-03-15T12:13:52Z db.sqlite

Restart Garage
Resync recent changes:

garage repair -a --yes tables

This runs quickly as only changes since the snapshot need synchronization.

If your cluster is not replicated, you’ll lose all changes since the snapshot was taken.

Option 3: Restore Filesystem Snapshot If using ZFS or BTRFS to snapshot your metadata partition:

Refer to their specific documentation for rolling back or copying files from snapshots
Restart Garage
Run garage repair -a --yes tables

Depending on filesystem and database engine properties, snapshots taken during write operations may also be corrupted.

Multiple Simultaneous Failures

If multiple nodes fail simultaneously:

Assess the situation:

garage status

Determine which zones are affected and how many nodes remain.

Check data availability:

With 3-way replication:

1 zone down: Full availability maintained
2 zones down: Data loss likely, recovery may be partial

Prioritize recovery:
- First: Restore nodes to meet quorum (majority of nodes)
- Second: Verify data integrity with garage repair
- Third: Check for lost blocks with garage block list-errors
If data is lost:

See Inspecting Lost Blocks for recovery options.

Preventive Measures

Automatic Metadata Snapshots

Configure automatic snapshots in your config file:

metadata_auto_snapshot_interval = "24h"

Filesystem-Level Snapshots

Use ZFS or BTRFS for your metadata partition:

# ZFS example
zfs snapshot tank/garage-meta@$(date +%Y%m%d-%H%M%S)

# BTRFS example
btrfs subvolume snapshot /mnt/garage-meta /mnt/snapshots/garage-meta-$(date +%Y%m%d-%H%M%S)

Regular Health Checks

Run periodic repairs:

# Weekly metadata check
garage repair --all-nodes --yes tables

# Monthly block verification (automatic via scrub)
garage repair scrub start

Monitoring

Monitor critical metrics:

cluster_healthy 1
block_resync_errored_blocks 0
cluster_connected_nodes 3

See Monitoring Guide for complete details.

Testing Your Recovery Plan

Regularly test your recovery procedures in a non-production environment:

Simulate node failure by stopping Garage
Practice metadata restoration from snapshots
Verify data resync procedures
Time recovery operations to understand RTO/RPO
Document lessons learned

Best Practices

Deploy across 3+ zones for maximum resilience
Use SSDs for metadata to reduce corruption risk
Enable automatic snapshots on all nodes
Monitor cluster health continuously
Test recovery procedures regularly
Keep spare hardware for quick replacements
Document your topology including node IDs and zones
Back up cluster layout configuration file

Get Started

Design & Architecture

Installation

Deployment

Operations

Integration

Advanced

Disaster Recovery

Availability Guarantees

Recovery Scenario 1: Removing a Failed Node

Procedure

Recovery Scenario 2: Data Lost, Metadata Intact

Procedure

Recovery Scenario 3: Metadata Lost

When Both Metadata and Data Are Lost

Procedure

When Only Metadata Is Corrupted

Locate Your Database File

Recovery Options

Multiple Simultaneous Failures

Preventive Measures

Automatic Metadata Snapshots

Filesystem-Level Snapshots

Regular Health Checks

Monitoring

Testing Your Recovery Plan

Best Practices

See Also

Build docs developers (and LLMs) love

Get Started

Design & Architecture

Installation

Deployment

Operations

Integration

Advanced

Documentation Index

​Availability Guarantees

​Recovery Scenario 1: Removing a Failed Node

​Procedure

​Recovery Scenario 2: Data Lost, Metadata Intact

​Procedure

​Recovery Scenario 3: Metadata Lost

​When Both Metadata and Data Are Lost

​Procedure

​When Only Metadata Is Corrupted

​Locate Your Database File

​Recovery Options

​Multiple Simultaneous Failures

​Preventive Measures

​Automatic Metadata Snapshots

​Filesystem-Level Snapshots

​Regular Health Checks

​Monitoring

​Testing Your Recovery Plan

​Best Practices

​See Also

Build docs developers (and LLMs) love

Availability Guarantees

Recovery Scenario 1: Removing a Failed Node

Procedure

Recovery Scenario 2: Data Lost, Metadata Intact

Procedure

Recovery Scenario 3: Metadata Lost

When Both Metadata and Data Are Lost

Procedure

When Only Metadata Is Corrupted

Locate Your Database File

Recovery Options

Multiple Simultaneous Failures

Preventive Measures

Automatic Metadata Snapshots

Filesystem-Level Snapshots

Regular Health Checks

Monitoring

Testing Your Recovery Plan

Best Practices

See Also