Skip to content

feat(kubernetes): expose agent-sandbox operatingMode (suspend/resume) for idle scale-to-zero with PVC retention #2035

Description

@austin-shih

Problem Statement

Disposable per-user sandboxes backed by durable PVCs (as proposed in open PR
#2034) give great continuity, but for lifecycle management Kubernetes OpenShell
exposes create/delete but no suspend/resume. There's no way to free a sandbox's
compute when idle while retaining its identity and PVC
. For deployments with many
provisioned-but-often-idle users, keeping every pod running 24/7 is a large,
mostly-wasted compute cost; deleting on idle instead loses the sandbox identity
(and, without #2034, the data). We want: idle → free compute; next login → resume
with state intact.

This is the Kubernetes-driver realization of the general capability proposed in
#1823 (checkpoint/pause/resume), scoped concretely to what agent-sandbox already
implements.

Proposed Design

Surface agent-sandbox's existing suspend/resume capability through the OpenShell
Kubernetes driver and gateway/CLI — the lifecycle analog of how #2034 surfaced
pod-template/volume config through driver_config.kubernetes.

agent-sandbox already implements this in its Sandbox CRD and controller:

  • v1beta1: spec.operatingMode: Running | Suspended (default Running).
  • v1alpha1: the equivalent is spec.replicas (0 = suspended; the API
    conversion maps Suspended ↔ replicas=0).
  • On Suspended, the controller deletes the backing Pod (frees CPU/memory)
    while leaving the Sandbox object and its PVCs in place — PVCs are reconciled
    independently and removed only when the Sandbox itself is deleted. Status surfaces
    a Suspended condition (PodTerminated / PodNotTerminated).
  • On Running, the controller recreates the Pod and reattaches the same PVC(s).

What OpenShell would add:

  • Driver: set operatingMode (v1beta1) / replicas=0 (v1alpha1) on the managed
    Sandbox CR to suspend, and flip back to resume.
  • Gateway: keep the sandbox registered across a suspend (don't treat the absent
    Pod as a dead sandbox) and re-route on resume when the Pod returns.
  • Interface: a lifecycle op (e.g. openshell sandbox suspend|resume) and/or an
    idle policy; resume triggered by the controlling app on session start.
  • Existing seam: OpenShell already defines a StopSandbox RPC in the
    compute-driver contract (proto/compute_driver.proto), but it is currently
    unimplemented for the Kubernetes driver
    (crates/openshell-driver-kubernetes/src/grpc.rs) — a natural hook for wiring
    suspend, with resume as its counterpart.
  • Pairs with open PR feat(kubernetes): support PVC subPath driver config #2034: that PR proposes the durable, caller-owned per-user PVC;
    this gives the lifecycle to free its compute while keeping the data.

Alternatives Considered

  • Always-on pods (status quo): simplest, but pays compute for every provisioned
    user, not just active ones — expensive at scale.
  • Delete + recreate on idle/login: frees compute, but churns the sandbox identity
    and pays a full cold-create each login; with feat(kubernetes): support PVC subPath driver config #2034 the data survives, but
    registration/orphan handling is messier than a first-class suspend.
  • In-place pod restart only: doesn't free compute.
  • operatingMode suspend/resume is preferable: it's a first-class primitive
    already modeled and implemented in agent-sandbox; OpenShell only needs to expose
    and drive it.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Related: #1823 (general checkpoint/pause/resume design), #1551 (VM-driver
suspend/resume). Builds on open PR #2034 by @mjamiv.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions