Pause and Resume Kubernetes Sandboxes with Snapshots

OpenSandbox supports pausing and resuming Kubernetes-backed sandboxes without losing filesystem state. When you pause a sandbox, the controller commits its root filesystem as an OCI image to a configured registry, then releases the underlying cluster resources (Pods and pooled allocations). When you resume, the same sandbox ID is reused — the controller rewrites the pod template to the latest snapshot image and recreates the runtime. This lets you free cluster capacity between agent tasks while retaining the exact filesystem state produced by prior work.

Pause and resume is a Kubernetes-only feature. It requires the OpenSandbox controller to be deployed and a reachable OCI registry for storing snapshots. The feature is not available in Docker (single-host) mode.

Overview

	Behavior
Pause	Creates an internal `SandboxSnapshot`, commits the running container root filesystem as an OCI image, then quiesces the sandbox runtime and releases Pods and pooled allocations
Resume	Reuses the same `BatchSandbox`, rewrites its template to the latest snapshot image, and recreates the runtime from that image
Sandbox ID	Stable across pause/resume cycles — callers use the same ID throughout the sandbox lifetime
Replica support	Currently limited to `BatchSandbox.spec.replicas=1`

Sandbox Lifecycle States

The sandbox transitions through both stable and intermediate states during pause and resume:

State	Type	Description
`Running`	Stable	Sandbox is active and processing requests
`Pausing`	Intermediate	Pause in progress — snapshot commit is coordinated through an internal `SandboxSnapshot` resource
`Paused`	Stable	Sandbox is paused, the latest rootfs snapshot is ready, and runtime Pods and pooled allocations have been released
`Resuming`	Intermediate	Resume in progress — the controller is rewriting the sandbox template to the latest snapshot image and recreating the runtime
`Failed`	Stable	Operation failed — check `reason` and `message` for details

SandboxSnapshot Internal States

For detailed progress tracking during a pause, inspect the internal SandboxSnapshot resource:

Phase	Description
`Pending`	Snapshot request accepted; waiting to resolve source Pod or create commit Job
`Committing`	Commit Job is running and pushing snapshot images to the registry
`Succeed`	Snapshot is ready and will be used for the next resume
`Failed`	Snapshot creation failed

What Is Preserved

	Preserved?
Root filesystem contents	✅ Yes — committed as OCI image
Environment variables	✅ Yes — from `BatchSandbox` template
Running processes / memory	❌ No — process state is not checkpointed
Explicit volume mounts	Depends on volume type

Key Design Principle

Controller-level configuration — registry URL and push/pull secrets are configured on the Kubernetes controller manager, not in ~/.sandbox.toml. SDK users and API callers require no code changes to use pause and resume. They simply call pause() and resume() on the existing sandbox ID.

Pause and resume is currently limited to BatchSandbox.spec.replicas=1. Server-created Kubernetes sandboxes use replicas: 1 by default. If you create BatchSandbox CRs directly with a different replica count, the controller will reject the pause request.

Prerequisites

Before using pause and resume, ensure the following are in place:

Kubernetes Runtime

Your OpenSandbox server must be running in Kubernetes mode with the controller deployed to the cluster.

OCI Registry

An OCI-compatible registry (Docker Hub, GHCR, Harbor, or a private registry:2 instance) must be accessible from cluster nodes for push and from the kubelet for pull on resume.

Registry Secrets

Kubernetes Secrets of type kubernetes.io/dockerconfigjson must exist in the sandbox namespace for both push (commit Job) and pull (resumed Pod).

Controller Configured

The controller manager must be started with --snapshot-registry, --snapshot-push-secret, and --resume-pull-secret flags pointing to your registry.

Controller Configuration Reference

Configure the controller manager deployment with snapshot flags:

- --snapshot-registry=registry.example.com/sandboxes
- --snapshot-registry-insecure=false
- --snapshot-push-secret=registry-snapshot-push-secret
- --resume-pull-secret=registry-pull-secret

Flag	Default	Description
`--snapshot-registry`	`""`	Required. OCI registry prefix. Images are stored as `<registry>/<sandboxName>-<container>:snap-gen<N>`.
`--snapshot-registry-insecure`	`false`	Enables insecure registry mode for snapshot push. Use only for HTTP or self-signed local registries.
`--snapshot-push-secret`	`""`	Kubernetes Secret name for pushing snapshots. Must be `kubernetes.io/dockerconfigjson` type.
`--resume-pull-secret`	`""`	Kubernetes Secret name injected into resumed sandboxes for pulling snapshot images.
`--image-committer-image`	`"image-committer:dev"`	Image used by commit Jobs.
`--commit-job-timeout`	`"10m"`	Timeout for commit Jobs.

Helm Chart Values

The opensandbox-controller Helm chart exposes the snapshot-related controller values directly:

controller.snapshot.imageCommitterImage
controller.snapshot.commitJobTimeout
controller.snapshot.registry
controller.snapshot.registryInsecure
controller.snapshot.snapshotPushSecret
controller.snapshot.resumePullSecret

For the all-in-one opensandbox chart, use the same values under the opensandbox-controller.* prefix.

Usage

Once the controller manager is configured and the server is running, pause and resume work through the standard Lifecycle API with no SDK changes required.

Create a sandbox normally

Create a sandbox using the standard API. No special parameters are needed to enable pause/resume support.

import asyncio
from opensandbox import Sandbox

async def main():
    sandbox = await Sandbox.create(
        image="opensandbox/code-interpreter:latest",
    )
    print(f"Sandbox ID: {sandbox.id}")

asyncio.run(main())

Pause the sandbox

Call pause() to commit the root filesystem as an OCI snapshot and release cluster resources. The call returns when the sandbox reaches the Paused state.

await sandbox.pause()
print("Sandbox is now paused — cluster resources released")

Resume from snapshot

Call resume() using the same sandbox ID. The controller rewrites the pod template to the latest snapshot image and recreates the runtime. The returned object has the same sandbox ID.

resumed = await sandbox.resume()
print(f"Resumed sandbox ID: {resumed.id}")  # Same ID as before

Use the sandbox normally

After resume() returns, the sandbox is in Running state with the same filesystem state from before the pause. Running processes and in-memory state are not restored.

result = await resumed.commands.run("ls /workspace")
print(result.logs.stdout)

Full SDK Examples

import asyncio
from opensandbox import Sandbox

async def main():
    # Create
    sandbox = await Sandbox.create(
        image="opensandbox/code-interpreter:latest",
    )
    sandbox_id = sandbox.id

    # Do some work
    await sandbox.commands.run("echo 'hello' > /workspace/output.txt")

    # Pause — releases cluster resources
    await sandbox.pause()
    print(f"Sandbox {sandbox_id} paused")

    # ... time passes, cluster resources are freed ...

    # Resume — restores filesystem from OCI snapshot
    resumed = await sandbox.resume()
    print(f"Sandbox {sandbox_id} resumed")

    # Filesystem state is intact
    result = await resumed.commands.run("cat /workspace/output.txt")
    print(result.logs.stdout)  # "hello"

    await resumed.kill()

asyncio.run(main())

Multiple Pause/Resume Cycles

Pause and resume can be repeated. Each pause cycle produces a new snapshot image tag (snap-gen1, snap-gen2, and so on). The controller always uses the latest snapshot for the next resume. This means you can safely run a long-lived agent workflow across many pause/resume cycles, accumulating filesystem changes across each run.

Get Started

SDKs

CLI & MCP

Guides

Deployment

Architecture

Pause and Resume Kubernetes Sandboxes with Snapshots

Overview

Sandbox Lifecycle States

SandboxSnapshot Internal States

What Is Preserved

Key Design Principle

Prerequisites

Kubernetes Runtime

OCI Registry

Registry Secrets

Controller Configured

Controller Configuration Reference

Helm Chart Values

Usage

Full SDK Examples

Multiple Pause/Resume Cycles

Build docs developers (and LLMs) love

Get Started

SDKs

CLI & MCP

Guides

Deployment

Architecture

Documentation Index

​Overview

​Sandbox Lifecycle States

​SandboxSnapshot Internal States

​What Is Preserved

​Key Design Principle

​Prerequisites

Kubernetes Runtime

OCI Registry

Registry Secrets

Controller Configured

​Controller Configuration Reference

​Helm Chart Values

​Usage

​Full SDK Examples

​Multiple Pause/Resume Cycles

Build docs developers (and LLMs) love

Overview

Sandbox Lifecycle States

SandboxSnapshot Internal States

What Is Preserved

Key Design Principle

Prerequisites

Controller Configuration Reference

Helm Chart Values

Usage

Full SDK Examples

Multiple Pause/Resume Cycles