Instance Health Monitoring and Auto-Management

Universe runs two background services that keep the cluster in a healthy state without manual intervention. The InstanceHealthMonitor checks every five seconds whether each instance on the local node is still running and marks it OFFLINE if the process has exited. The InstanceCountEnforcer — active only on the Master node — checks every five seconds whether each configuration’s active instance count is below its configured minimum and automatically spawns replacements. Together these two loops mean the cluster self-heals from process crashes and node failures with no operator action required.

Instance states

Every instance tracked in the cluster state map carries an InstanceState value that reflects where it is in its lifecycle.

State	Meaning
`CREATING`	The deploy task has been dispatched to a Wrapper node; the instance process has not started yet
`ONLINE`	The instance is running and (optionally) sending regular heartbeats via `PUT /api/instances/{id}/state`
`OFFLINE`	The health monitor detected the process is no longer running, or the Wrapper node disconnected; the instance record is retained in the state map
`STOPPED`	The instance was stopped intentionally via `instance stop` or the stop REST endpoint

OFFLINE instances are kept in the cluster state map deliberately. A Wrapper node disconnecting from Hazelcast does not remove instance records — it only stops heartbeats. The records remain until you explicitly stop or recreate them.

Instance health monitor

InstanceHealthMonitor runs on every node (Master and Wrapper alike). It uses a single-threaded scheduled executor with a five-second fixed-rate interval. What the check loop does:

Reads the local Hazelcast member UUID.
Filters all cluster instances to those whose wrapperNodeId matches the local UUID and whose state is ONLINE.
For each matching instance, looks up the RuntimeProvider registered under the configuration’s runtime key.
Calls runtimeProvider.isRunning(instance.id) — this checks whether the screen session, tmux window, or other runtime process is still alive.
If the runtime reports the process is gone, calls markOffline().

markOffline() does the following in order:

Releases the instance’s allocated port back to the PortAllocator.
Subtracts the instance’s allocatedRamMB and allocatedCpu from the node’s resource tracking.
For non-static instances, deletes the working directory at ./running/<instance-id>/.
Updates the instance’s state to OFFLINE in the shared Hazelcast IMap.

[WARNING] Instance ab12cd is no longer running (runtime=screen), marking OFFLINE
[INFO]    Cleaned up working directory for dead instance ab12cd
[INFO]    Instance ab12cd marked OFFLINE and resources released

Static instances (those with "static": true) have their working directory preserved when marked offline. The health monitor only deletes ./running/<instanceId>/ for non-static instances.

Instance count enforcer

InstanceCountEnforcer runs only on the Master node. If isMasterNode is false in config.json, the service logs a message and exits without scheduling anything. What the enforcement loop does:

Reads all loaded configurations from the cluster state map.
Skips configurations with static: true or minimumServiceCount ≤ 0.
Counts instances for each configuration whose state is ONLINE or CREATING.
If the count is below minimumServiceCount, calls InstanceCreationService.createInstance() for each missing instance.

Resource-aware node selection happens inside createInstance(): the Master evaluates all connected Wrapper members and picks the one with enough free RAM and CPU to satisfy the configuration’s ramMB and cpu requirements. If no node has sufficient headroom, the auto-spawn attempt fails and a warning is logged.

[WARNING] Config 'default' has 0 active instance(s), minimum=2. Spawning 2...
[SUCCESS] Auto-spawned instance ab12cd for config 'default' on node a1b2c3d4-...
[SUCCESS] Auto-spawned instance ef34gh for config 'default' on node 9f8e7d6c-...

Enforcer configuration — set minimumServiceCount in ./configuration/<name>.json:

{
  "name": "default",
  "minimumServiceCount": 2,
  "ramMB": 512,
  "cpu": 100
}

With this configuration, the enforcer guarantees at least two default instances are running or being created at all times.

Checking instance state

Use instance info <id> to inspect the current state of any instance, including its last heartbeat timestamp and the PID reported by the runtime:

instance info ab12cd

=== Instance ab12cd ===
  Configuration: default
  Static: false
  State: ONLINE
  Host: 127.0.0.1:25565
  Wrapper: a1b2c3d4-e5f6-7890-abcd-ef1234567890
  PID: 94321
  Last heartbeat: 1746960000000
  Working dir: ./running/ab12cd

REST API: instance state and heartbeats

External processes (such as a Minecraft plugin running inside the instance) report their health by calling the state endpoint:

PUT /api/instances/{id}/state

This endpoint updates both the instance state and the lastHeartbeat timestamp in the shared IMap. A rising lastHeartbeat value is the signal that the application inside the instance is healthy, not just that the OS process is alive. List all instances (including their current states):

curl http://localhost:7000/api/instances

Update state and heartbeat from inside an instance:

curl -X PUT http://localhost:7000/api/instances/ab12cd/state \
  -H "Content-Type: application/json" \
  -d '{"state": "ONLINE"}'

Cluster resilience when a Wrapper disconnects

When a Wrapper node loses its Hazelcast connection — due to a network partition, container restart, or host failure — the following happens:

The Hazelcast cluster detects the member departure and fires a memberRemoved event.
ResilienceMembershipListener on the Master immediately marks all instances that were running on the disconnected Wrapper as OFFLINE in the shared IMap.
Node resource tracking for the disconnected member is cleared.
The InstanceInfo records are retained — they are not removed — so external services can see what was running.

Instance records are preserved after a Wrapper disconnects. The InstanceCountEnforcer will detect that the OFFLINE instances no longer count toward minimumServiceCount and will automatically spawn replacements on available nodes.

Resource-aware node selection

When the Master needs to place a new instance — either from instance create or from the count enforcer — it evaluates every connected Hazelcast member:

Each member tracks its consumed RAM (usedRamMB) and CPU (usedCpu) in the cluster state map via NodeResources.
The Master subtracts used resources from the node’s total capacity and selects the first member that can satisfy the configuration’s ramMB and cpu requirements.
When an instance is marked OFFLINE or STOPPED, its allocatedRamMB and allocatedCpu are returned to the node’s available pool via ClusterStateService.removeNodeResources().

To prevent a single node from being overloaded, set realistic ramMB and cpu values in each configuration. The enforcer respects these limits and will log a warning rather than over-commit a node.

Monitoring summary

Health monitor

Runs on every node. 5-second interval. Marks instances OFFLINE when the runtime process exits and releases ports and resources.

Count enforcer

Runs on Master only. 5-second interval. Auto-spawns instances when the active count falls below minimumServiceCount.

Instance states

Four states: CREATING, ONLINE, OFFLINE, STOPPED. Inspect with instance info <id> or GET /api/instances.

Heartbeat API

External processes signal health via PUT /api/instances/{id}/state. The lastHeartbeat field is updated on each call.

Get Started

Configuration

Operations

Extensions

Instance Health Monitoring and Auto-Management

Instance states

Instance health monitor

Instance count enforcer

Checking instance state

REST API: instance state and heartbeats

Cluster resilience when a Wrapper disconnects

Resource-aware node selection

Monitoring summary

Health monitor

Count enforcer

Instance states

Heartbeat API

Build docs developers (and LLMs) love

Get Started

Configuration

Operations

Extensions

Documentation Index

​Instance states

​Instance health monitor

​Instance count enforcer

​Checking instance state

​REST API: instance state and heartbeats

​Cluster resilience when a Wrapper disconnects

​Resource-aware node selection

​Monitoring summary

Health monitor

Count enforcer

Instance states

Heartbeat API

Build docs developers (and LLMs) love

Instance states

Instance health monitor

Instance count enforcer

Checking instance state

REST API: instance state and heartbeats

Cluster resilience when a Wrapper disconnects

Resource-aware node selection

Monitoring summary