Cloud Repositorio: architecture and core components

Cloud Repositorio follows a centralized orchestrator pattern where a single Python process coordinates VM lifecycle and networking across a cluster of SSH-accessible worker nodes. All control-plane logic — scheduling, VLAN allocation, state persistence — runs on the orchestrator machine. Worker nodes are treated as dumb hypervisors that execute commands over SSH and report resource availability at startup.

Components

OrchestratorAPI

The top-level control plane for slice, VM, and link lifecycle. Accepts calls from the CLI and dispatches work to DeploymentAPI, VMLauncher, and VLANManager. Implements round-robin scheduling to distribute VMs across available workers.

DeploymentAPI

Handles per-VM provisioning: copies the base QCOW2 image to the target worker via SSH, creates the VM model object with assigned interfaces and VNC port, and records the image path for later cleanup.

VLANManager

Configures OVS on the network node (10.0.10.3). Creates VLAN gateway ports, starts dnsmasq in a dedicated network namespace (ns-dhcp-vlanN) for DHCP, and installs MASQUERADE iptables rules for outbound internet access.

VMLauncher

Builds and runs the qemu-system-x86_64 command line on the target worker over SSH. Creates TAP interfaces, attaches them to the OVS bridge with the correct VLAN tag, and daemonizes the process. Returns the QEMU PID on success.

RemoteExecutor

Thin wrapper around subprocess that executes shell commands on remote hosts via SSH. execute_direct() uses -o StrictHostKeyChecking=no; the SIGINT cleanup in main.py uses -o BatchMode=yes. Used by every component that interacts with worker nodes or the network node.

Database

Thread-safe YAML state store backed by database.yaml. Holds users, slices, worker specs, and auto-incrementing ID counters. Writes are serialized with threading.RLock. The HealthMonitor flushes state to disk every 15 seconds.

Request flow

The following steps describe what happens when OrchestratorAPI.deploy_slice() is called.

Configure per-link VLANs

For each Link in the slice, VLANManager.create_vlan_with_gateway() runs on the network node (10.0.10.3):

Adds an OVS internal port to br-int tagged with the link’s VLAN ID.
Assigns a gateway IP derived from the VLAN ID (e.g., VLAN 100 → 192.168.100.1/24).
Creates a network namespace named ns-dhcp-vlanN.
Starts dnsmasq inside the namespace to serve DHCP leases to VMs on that VLAN.

vlan_id = link.get("vlan_id")
cidr = f"192.168.{vlan_id % 256}.0/24"
gateway_ip = f"192.168.{vlan_id % 256}.1"
self.vlan_manager.create_vlan_with_gateway(vlan_id, cidr, gateway_ip, dhcp_enabled=True)

Configure internet access (VLAN 400)

If any VM in the slice has internet access enabled, VLANManager configures VLAN 400 on the network node:

Creates an OVS gateway port for 10.60.7.0/24 with gateway 10.60.7.1.
Starts a dnsmasq DHCP namespace for VLAN 400.
Installs an iptables MASQUERADE rule so that traffic from 10.60.7.0/24 is SNATed through the network node’s uplink interface.

VMs that have internet enabled are connected to VLAN 400 via their eth0 management interface.

Launch VMs

For each VM in the slice, VMLauncher.launch_vm() runs on the assigned worker node:

Creates one TAP interface per VM network interface.
Attaches each TAP to the OVS bridge br-int with the interface’s VLAN tag using ovs-vsctl set port.
Runs qemu-system-x86_64 with the provisioned QCOW2 image, KVM acceleration, the configured RAM and CPU count, and a VNC server bound to 0.0.0.0:<vnc_port>.
The QEMU process is daemonized (-daemonize) so it survives the SSH session.

Update database state

After each VM starts successfully, deploy_slice() updates the in-memory database:

Sets vm["status"] = "running" and vm["pid"] = <qemu_pid> for each VM.
Sets slice_data["status"] = "running" for the slice.
Calls db.update_slice(), which acquires the threading.RLock and writes the updated state. The HealthMonitor will persist it to disk within the next 15 seconds.

State model

All persistent state lives in a single YAML file, database.yaml, with the following top-level keys:

Key	Description
`users`	Map of username → user record (password hash, quota, slice list).
`workers`	Map of worker IP → resource specs (cores, RAM, disk, used amounts).
`workers_list`	Ordered list of worker IPs used for round-robin scheduling.
`slices`	Map of slice ID → full slice record (VMs, links, VLAN pool, status).
`next_vm_id`	Auto-incrementing integer used for both VM IDs and slice IDs (starts at 1000).
`next_vlan_id`	Auto-incrementing integer for globally unique VLAN allocation (starts at 100).

Writes are serialized using threading.RLock to prevent concurrent modification from the HealthMonitor background thread and the CLI foreground thread. On startup, main.py copies database.yaml to database.yaml.backup before any writes occur. The HealthMonitor saves the in-memory state back to disk every 15 seconds while the orchestrator is running.

Worker topology

The default cluster consists of three nodes:

Node	Role
`10.0.10.1`	Compute (VM hosting)
`10.0.10.2`	Compute (VM hosting)
`10.0.10.3`	Compute + network node (OVS)

At startup, WorkerDiscovery.discover_all() connects to each node over SSH and runs three commands to populate resource specs:

nproc                                              # CPU core count
free -g | grep Mem | awk '{print $2}'              # Total RAM in GB
df /tmp -BG | tail -1 | awk '{print $2}' | tr -d G # Available disk in GB at /tmp

The results are stored in database.yaml under the workers key and used for capacity tracking. If a node is unreachable, WorkerDiscovery logs an error and skips that node; default values (2 cores, 1 GB RAM, 500 GB disk) are used as fallback. VMs are distributed across all nodes in workers_list using round-robin. The same list includes 10.0.10.3, so the network node may also host VMs. The round_robin_idx counter is kept in memory on OrchestratorAPI and is not persisted between restarts.

The orchestrator process must remain running for slice cleanup to work correctly. Pressing Ctrl+C triggers the SIGINT handler in main.py, which:

Calls orchestrator.delete_slice() on the active slice — this stops all QEMU processes on their respective workers via pkill -9 qemu-system-x86_64 over SSH and removes VLAN configurations on the network node.
Runs a broad pkill -9 qemu-system-x86_64 on every node in workers_list as a safety net.
Deletes all ns-dhcp-vlan* network namespaces and removes gw_vlan*/dhcp_v* OVS ports from br-int on 10.0.10.3.
Removes database.yaml.backup and any local QCOW2/ISO files.

If the orchestrator process is killed with SIGKILL or crashes ungracefully, QEMU processes and VLAN namespaces on the worker nodes will not be cleaned up automatically. You will need to run the cleanup steps manually or restart the orchestrator and delete the slice through the CLI.

Get Started

Core Concepts

Operations

Configuration

Cloud Repositorio: architecture and core components

Components

OrchestratorAPI

DeploymentAPI

VLANManager

VMLauncher

RemoteExecutor

Database

Request flow

State model

Worker topology

Build docs developers (and LLMs) love

Get Started

Core Concepts

Operations

Configuration

Documentation Index

​Components

OrchestratorAPI

DeploymentAPI

VLANManager

VMLauncher

RemoteExecutor

Database

​Request flow

​State model

​Worker topology

Build docs developers (and LLMs) love

Components

Request flow

State model

Worker topology