Irreversible death decisions and the finality engine in Styx
Once Styx declares a node dead, that decision is permanent and irreversible. The node cannot be resurrected and must rejoin with a new identity. This is enforced by the Finality Engine.
Finality prevents several critical problems in distributed systems:
Zombie Nodes: Prevents “dead” nodes from rejoining with stale state
Flapping Identities: Stops nodes from oscillating between alive/dead
Data Consistency: Ensures distributed protocols can safely proceed after declaring a node dead
Split-Brain Prevention: Forces network partition resolution before rejoining
False death is catastrophic. Once declared dead, a node loses its identity. Styx requires overwhelming evidence before making this irreversible decision.
// IsDead checks if a node has been declared dead// P14: Once dead, always deadfunc (e *Engine) IsDead(id types.NodeID) bool { e.mu.RLock() defer e.mu.RUnlock() _, exists := e.dead[id] return exists}
Property 14: Death is Final
Once a node is in the death registry, IsDead() will return true forever. There is no way to remove an entry from this registry.
// AttemptResurrection tries to bring back a dead node// P14: This must ALWAYS failfunc (e *Engine) AttemptResurrection(id types.NodeID) error { e.mu.RLock() defer e.mu.RUnlock() if _, exists := e.dead[id]; exists { return ErrResurrection } return nil // wasn't dead anyway}
Attempting to resurrect a dead node will always fail with ErrResurrection. This is by design.
When a node needs to rejoin after being declared dead, it must use a new identity with an incremented generation counter:
source/types/node_id.go
type NodeID struct { // Base identifier (e.g., hash of address or UUID) Base uint64 // Generation counter - incremented on each identity rebirth Generation uint64}// Rebirth creates a new identity for a reborn node.// This MUST be used when a node returns after being declared dead.// The generation counter is incremented, making this a distinct identity.func (n NodeID) Rebirth() NodeID { return NodeID{ Base: n.Base, Generation: n.Generation + 1, }}
// Original nodeoriginal := types.NewNodeID(12345) // 000000000000000012345.g0// Node is declared dead by finality engineengine.DeclareDeath(original, belief, reports, true)// Node process restarts and needs to rejoinreborn := original.Rebirth() // 000000000000000012345.g1// Reborn node can now join with clean slateengine.IsDead(original) // true (old identity still dead)engine.IsDead(reborn) // false (new identity, not in death registry)
The base identifier stays the same, but the generation counter increments. This allows tracking that it’s the “same” physical node while treating it as a distinct logical identity.
The finality engine returns specific errors for different failure modes:
source/finality/engine.go
var ( ErrAlreadyDead = errors.New("node already declared dead") ErrInsufficientEvidence = errors.New("insufficient evidence for death declaration") ErrSilenceOnly = errors.New("cannot declare death from silence alone") ErrResurrection = errors.New("cannot resurrect a dead node"))
err := engine.DeclareDeath(nodeID, belief, reports, false)switch err {case finality.ErrAlreadyDead: // Node already in death registry log.Info("node already declared dead")case finality.ErrSilenceOnly: // Only have timeout evidence, need crash reports log.Warn("insufficient evidence: timeouts alone cannot declare death")case finality.ErrInsufficientEvidence: // Confidence too low, or not enough witnesses, or high disagreement log.Warn("evidence does not meet death declaration threshold")case nil: // Death successfully declared log.Fatal("node declared permanently dead")}