Garage is designed to work on old, second-hand hardware, which means drive failures are expected. Fortunately, Garage is fully equipped to handle most common failure scenarios.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/deuxfleurs-org/garage/llms.txt
Use this file to discover all available pages before exploring further.
Availability Guarantees
With nodes dispersed in 3 or more zones and 3-way replication (recommended):- Cluster remains fully functional if failures occur in only one zone
- Cluster can handle a complete zone outage (power/Internet)
- No data is lost if failures occur in at most two zones
Temporarily disconnected nodes automatically re-synchronize when they come back online. This guide focuses on permanent failures requiring manual intervention.
Recovery Scenario 1: Removing a Failed Node
Use this when:- You have no spare parts to replace failed components
- At least 3 nodes remain in the cluster
- You do not plan to replace the node
Procedure
- Get the failed node ID:
- Remove the node:
- Review the changes:
- Apply the changes:
Recovery Scenario 2: Data Lost, Metadata Intact
Use this when:- Only the HDD storing data blocks failed
- The SSD with metadata is still working
- Node configuration unchanged
Procedure
- Replace the failed HDD and mount it at the same path
- Restart Garage with the existing configuration
- Trigger data resync:
- Monitor progress:
Depending on the amount of data, this process may take several hours to complete.
Recovery Scenario 3: Metadata Lost
Use this when:- Full node failure (both metadata and data lost)
- Only metadata directory lost
- Metadata database corrupted
When Both Metadata and Data Are Lost
Starting from an empty metadata directory means Garage generates a new node ID. You’ll need to update the cluster layout.
Procedure
-
Set up replacement drives:
- New SSD for metadata (recommended)
- New HDD for data (if needed)
- Start Garage on the new node
- Verify new node ID:
NO ROLE ASSIGNED. The old node ID appears in the disconnected section.
- Replace the old node with the new one:
- Review and apply:
- Monitor synchronization:
When Only Metadata Is Corrupted
Use this when:- Your metadata database file is corrupted (e.g., after power outage)
- The node didn’t shut down properly
- You want to avoid changing node IDs
Recovering from corruption without changing node IDs means data blocks don’t need to be reshuffled - you only need to restore the metadata.
Locate Your Database File
Database location depends on yourdb_engine setting:
- LMDB:
<metadata_dir>/db.lmdb/ - Sqlite:
<metadata_dir>/db.sqlite
Recovery Options
Option 1: Resync from Other Nodes If your cluster has 2 or 3 copies of all data:- Stop Garage
- Delete the corrupted database:
- Restart Garage
- Repair metadata tables:
- Locate snapshots:
2024-03-15T12:13:52Z).
- Stop Garage
- Restore the snapshot:
- Restart Garage
- Resync recent changes:
- Refer to their specific documentation for rolling back or copying files from snapshots
- Restart Garage
- Run
garage repair -a --yes tables
Multiple Simultaneous Failures
If multiple nodes fail simultaneously:- Assess the situation:
- Check data availability:
- 1 zone down: Full availability maintained
- 2 zones down: Data loss likely, recovery may be partial
-
Prioritize recovery:
- First: Restore nodes to meet quorum (majority of nodes)
- Second: Verify data integrity with
garage repair - Third: Check for lost blocks with
garage block list-errors
- If data is lost:
Preventive Measures
Automatic Metadata Snapshots
Configure automatic snapshots in your config file:Filesystem-Level Snapshots
Use ZFS or BTRFS for your metadata partition:Regular Health Checks
Run periodic repairs:Monitoring
Monitor critical metrics:Testing Your Recovery Plan
Regularly test your recovery procedures in a non-production environment:
- Simulate node failure by stopping Garage
- Practice metadata restoration from snapshots
- Verify data resync procedures
- Time recovery operations to understand RTO/RPO
- Document lessons learned
Best Practices
- Deploy across 3+ zones for maximum resilience
- Use SSDs for metadata to reduce corruption risk
- Enable automatic snapshots on all nodes
- Monitor cluster health continuously
- Test recovery procedures regularly
- Keep spare hardware for quick replacements
- Document your topology including node IDs and zones
- Back up cluster layout configuration file