Worker nodes, discovery, and round-robin scheduling

Worker nodes are SSH-accessible machines running QEMU/KVM and Open vSwitch where virtual machines are actually launched. The orchestrator maintains a registry of worker specs collected at startup and uses a round-robin index to spread VMs evenly across all available workers. The compute and network roles can overlap: 10.0.10.3 runs VMs and also acts as the dedicated network node for VLAN gateway and DHCP operations.

Worker discovery

At startup, WorkerDiscovery.discover_all() iterates the workers_list from database.yaml and SSHs to each node to collect hardware specs:

Metric	Command	Field
CPU cores	`nproc`	`max_cores`
RAM (GB)	`free -g`	`max_ram_gb`
Disk space	`df /tmp`	`max_disk_gb`

ssh [email protected] nproc
ssh [email protected] 'free -g | grep Mem | awk "{print \$2}"'
ssh [email protected] 'df /tmp -B G | tail -1 | awk "{print \$2}" | tr -d G'

Workers that are unreachable (SSH timeout or non-zero exit) are silently skipped — they will not appear in the workers registry and will not receive VMs during that session. The default worker list from database.yaml:

workers_list: ["10.0.10.1", "10.0.10.2", "10.0.10.3"]

Worker specs schema

After discovery, each worker is stored under the workers key in database.yaml:

workers:
  10.0.10.1:
    ip: 10.0.10.1
    max_vms: 10
    max_cores: 2
    max_ram_gb: 1
    max_disk_gb: 500
    used_cores: 0
    used_ram_gb: 0
    used_disk_gb: 0

All three workers share the same schema. The used_* fields start at 0 after each discovery run.

Round-robin scheduling

OrchestratorAPI.get_next_worker() selects the target worker for each new VM using a simple round-robin index:

def get_next_worker(self):
    worker = self.workers[self.round_robin_idx % len(self.workers)]
    self.round_robin_idx += 1
    return worker

self.workers is loaded from workers_list in the database at initialization. With three workers and sequential VM additions, the assignment pattern is 10.0.10.1 → 10.0.10.2 → 10.0.10.3 → 10.0.10.1 → …. The index is not persisted between sessions, so it resets to 0 on each restart.

Network node

Worker 10.0.10.3 serves a dual role: it accepts QEMU VMs from the round-robin scheduler just like the other workers, and it is also the network node targeted by VLANManager for all OVS gateway and DHCP namespace operations. This means:

All gw_vlan{id} OVS ports are created on 10.0.10.3.
All ns-dhcp-vlan{id} network namespaces and dnsmasq processes run on 10.0.10.3.
IP forwarding and MASQUERADE rules for VLAN 400 internet access are applied on 10.0.10.3.

The network node IP is hardcoded as the default in VLANManager:

class VLANManager:
    def __init__(self, remote_executor, network_node_ip="10.0.10.3"):

Resource accounting is approximate. The system tracks used_cores, used_ram_gb, and used_disk_gb per worker in database.yaml, but these counters are populated by WorkerDiscovery at startup (set to 0) and incremented as VMs are added. They are not decremented when a VM or slice is deleted. After several create/delete cycles the counters will diverge from actual usage. Restart the manager to reset them via a fresh discovery run.

To add or remove workers, edit the workers_list key in database.yaml before starting the manager. Workers added to the list will be probed during discover_all() at next startup and — if reachable — will begin receiving VMs immediately. Removing a worker from the list prevents new VMs from being scheduled to it but does not affect VMs already running on that host.

Get Started

Core Concepts

Operations

Configuration

Worker nodes, discovery, and round-robin scheduling

Worker discovery

Worker specs schema

Round-robin scheduling

Network node

Build docs developers (and LLMs) love

Get Started

Core Concepts

Operations

Configuration

Documentation Index

​Worker discovery

​Worker specs schema

​Round-robin scheduling

​Network node

Build docs developers (and LLMs) love

Worker discovery

Worker specs schema

Round-robin scheduling

Network node