HealthMonitor: state persistence and crash recovery

Cloud Repositorio includes a background HealthMonitor thread that periodically saves the YAML database to disk. This provides basic durability against process crashes: if the orchestrator exits unexpectedly, the most recently saved state is available on the next startup.

HealthMonitor

HealthMonitor runs as a daemon thread so it does not prevent the process from exiting. It calls db.save() on a fixed interval, then sleeps until the next cycle. Errors during save are logged but do not stop the monitor.

class HealthMonitor:
    def __init__(self, db, interval=15):
        self.db = db
        self.interval = interval
        self.running = False
        self.thread = None

    def start(self):
        self.running = True
        self.thread = threading.Thread(target=self._monitor_loop, daemon=True)
        self.thread.start()
        logger.info("Health monitor started (15s interval)")

    def stop(self):
        self.running = False
        if self.thread:
            self.thread.join(timeout=5)

    def _monitor_loop(self):
        while self.running:
            try:
                self.db.save()
                time.sleep(self.interval)
            except Exception as e:
                logger.error(f"Monitor error: {e}")

The monitor is started in CLI.run() before the login prompt and stopped (via monitor.stop()) when the loop exits. The default interval is 15 seconds and can be changed by passing a different value to the interval parameter when constructing HealthMonitor.

State persistence

All mutable state is stored in database.yaml. The file is written atomically by db.save() on every monitor tick. It contains:

Users — username, SHA-256 password hash, VM quota, used_vms count, and a list of owned slice IDs.
Slices — full topology including every VM’s interfaces, MAC addresses, VLAN assignments, QCOW2 image paths, QEMU PIDs, and statuses; link records; and the VLAN pool state.
Workers — capacity specs (max VMs, RAM, cores, disk) and current usage counters.
ID counters — next_vm_id and next_vlan_id to ensure unique IDs survive restarts.

Representative structure:

workers_list: ["10.0.10.1", "10.0.10.2", "10.0.10.3"]

users:
  admin:
    username: admin
    password_hash: 8c6976e5b5410415bde908bd4dee15dfb167a9c873fc4bb8a81f6f2ab448a918
    quota_vms: 10
    used_vms: 2
    slices: [1000]

workers:
  10.0.10.1:
    max_vms: 10
    max_ram_gb: 1
    max_cores: 2
    max_disk_gb: 500
    used_ram_gb: 0.5
    used_cores: 1
    used_disk_gb: 1

slices:
  1000:
    slice_id: 1000
    owner: admin
    status: running
    vlan_pool_start: 100
    vlan_pool_end: 119
    vlan_pool_used: [100]
    vms:
      - vm_id: 1001
        name: web
        worker_ip: 10.0.10.1
        vnc_port: 5901
        status: running
        pid: "12345"
        qcow_image: ~/vm_images/web_img.qcow2
        flavor:
          cores: 1
          ram_gb: 0.5
          disk_gb: 1
          image: /tmp/vm_images/cirros-0.6.2-x86_64-disk.img
        interfaces:
          - name: eth0
            mac: "52:54:00:03:e9:00"
            vlan_id: 400
            link_id: null
          - name: eth1
            mac: "52:54:00:03:e9:01"
            vlan_id: 100
            link_id: 1
    links:
      - link_id: 1
        vlan_id: 100
        vm1_id: 1001
        vm1_interface: eth1
        vm2_id: 1002
        vm2_interface: eth1

next_vm_id: 1002
next_vlan_id: 100

Startup backup

When the orchestrator starts, main.py copies the current database to a backup file before loading it:

shutil.copy(db_path, db_backup)
# db_path   = "database.yaml"
# db_backup = "database.yaml.backup"

This gives you a single-level rollback point that reflects the state at the time of the last clean startup.

Manual recovery

To roll back to the startup snapshot, stop the orchestrator and copy the backup over the live database:

cp database.yaml.backup database.yaml

Then restart the orchestrator. It will load the restored state and resume monitoring from the backup snapshot.

The backup is written only once at startup, not on every periodic save. If the orchestrator ran for a long time before crashing, database.yaml.backup may be significantly out of date. Use it as a last resort rather than a reliable restore point.

Logging

All components (CLI, OrchestratorAPI, DeploymentAPI, VMLauncher, QCOWManager, HealthMonitor) use Python’s logging module. The root logger is configured in cli.py:

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

Key events logged at INFO level:

Event	Sample message
Worker list loaded	(logged at startup by database load)
Slice created	`Slice 1000 created for admin (VLAN pool: 100-119)`
VM created	`VM web created (VNC: 5901, Flavor: {...}, Internet: True, Data IFs: 1)`
VLAN configuration	`Configuring VLAN 100 for Link 1`
QEMU launch	`Launching VM web on 10.0.10.1: sudo qemu-system-x86_64 ...`
VM started	`VM 1001 started with PID 12345`
dnsmasq / DHCP	(logged by VLANManager when configuring DHCP for a VLAN)
Health monitor	`Health monitor started (15s interval)`
Monitor save error	`Monitor error: <exception message>`

To see detailed QEMU command lines, TAP interface operations, SSH output, and other low-level events, change the basicConfig level to logging.DEBUG in cli.py line 9:

logging.basicConfig(level=logging.DEBUG, format="%(asctime)s - %(levelname)s - %(message)s")

Get Started

Core Concepts

Operations

Configuration

HealthMonitor: state persistence and crash recovery

HealthMonitor

State persistence

Startup backup

Manual recovery

Logging

Build docs developers (and LLMs) love

Get Started

Core Concepts

Operations

Configuration

Documentation Index

​HealthMonitor

​State persistence

​Startup backup

​Manual recovery

​Logging

Build docs developers (and LLMs) love

HealthMonitor

State persistence

Startup backup

Manual recovery

Logging