Skip to content

docs(rfc): add RFC-0011 multi-player support design#1980

Open
derekwaynecarr wants to merge 1 commit into
NVIDIA:mainfrom
derekwaynecarr:decarr/multi-player-design
Open

docs(rfc): add RFC-0011 multi-player support design#1980
derekwaynecarr wants to merge 1 commit into
NVIDIA:mainfrom
derekwaynecarr:decarr/multi-player-design

Conversation

@derekwaynecarr

Copy link
Copy Markdown
Collaborator

Summary

This RFC introduces multi-player support for OpenShell by adding namespaces as hard isolation boundaries, expanding the role model to five roles (Platform Admin, Namespace Admin, Operator, User, Service
Account), and threading ownership through the sandbox lifecycle. The Kubernetes compute driver gains two namespace mapping modes — managed (default), which creates gateway-scoped Kubernetes namespaces
(openshell-{gateway-id}-{namespace}), and operator mode for 1:1 passthrough to pre-existing namespaces. The design preserves backwards compatibility for single-player support via a default namespace.

Related Issue

#1977

Changes

Namespaces as first-class hard isolation boundaries for sandboxes, providers, and policies, with a default namespace for backwards compatibility

  • Expanded role model from two-tier (admin/user) to five roles: Platform Admin, Namespace Admin, Operator, User, Service Account
  • Ownership tracking via created_by on ObjectMeta, with owner-scoped access guards on all sandbox operations
  • Kubernetes namespace mapping with two modes: managed (default, creates openshell-{gateway-id}-{namespace-name}) and operator (1:1 name passthrough to pre-existing K8s namespaces)
  • Multi-gateway cluster support via gateway-identifier-scoped Kubernetes namespace naming to avoid collisions
  • Provider credential scoping to namespaces, with delegation from Namespace Admins to users/service accounts
  • Policy inheritance where Namespace Admins can tighten (but not loosen) gateway-wide defaults
  • Multi-provider OIDC with identity federation, plus API key authentication for service accounts
  • Control-plane audit trail via OCSF ApiActivity events on every mutating gRPC call, with session attribution back to the creating principal
  • Per-namespace quotas for concurrent sandboxes, GPU allocations, and sandbox lifetime
  • Cost attribution metadata tagging sandbox consumption with owner, namespace, and labels
  • Sandbox sharing within namespaces (read-only or exec access) without global visibility

Testing

  • [x ] mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • [ x] Follows Conventional Commits
  • [ x] Commits are signed off (DCO)
  • [ x] Architecture docs updated (if applicable)

@derekwaynecarr derekwaynecarr requested review from a team, maxamillion and mrunalp as code owners June 23, 2026 13:37
@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- **Phase 1: Namespace and ownership model.** Add `namespace` and `created_by`
fields to `ObjectMeta` in the proto. Implement namespace-scoped storage and
filtering in gRPC handlers. Create the `default` namespace for backwards
compatibility. Sandbox name uniqueness shifts from globally unique to

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it critical to implement this in a backward compatible way right now?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we didn't create a default namespace what would be the single-player UX?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spirit of the default namespace is that a user never thinks about namespaces at all when using openshell in a single player setup, so the default or some other token is just there to make sure there is no friction in the single player experience by adding multiplayer support.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also (automatically) create a namespace per user account in single player mode. This sets us up to have a single gateway for the workstation while supporting different user accounts on that workstation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly thinking about the upgrade path from N-1 to N (it's ok to require users to destroy their old setup and start over still). I think I naively assume we'll have some form of authentication for gateway users and can create them a user-associated namespace automatically. This probably works out to be about the same as default though.


### Kubernetes Compute Driver: Namespace Mapping

OpenShell namespaces are a logical concept. When the Kubernetes compute driver

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to outline what behaviors/patterns we hope to enable and control via namespaces. I found myself inventing reasons that it'd be useful to have namespaces, but I think a list of practical applications would be helpful.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the simplest example i have is the friction we hit when doing a team-level gateway setup. within a team, its common for users to have their own dedicated API keys to access claude or codex, and these are private to the individual. that friction leads folks towards wanting a gateway per trust/security domain when a common gateway with some credential segmentation would satisfy. this proposal enables that concept. it also would be safe to now share a sandbox (for connect/exec) actions among users in shared coding sessions, etc. since the literal credentials are left outside the sandbox.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr in that scenario, as described Provider Credential Scoping section a namespace admin would have to create and lifecycle manage ever user's credentials in their namespace or would each user have their own namespace and therefore be a namespace admin?


- What is the identity mapping strategy for multi-provider OIDC? If a user
authenticates via both corporate SSO and GitHub, how are those identities
linked to a single internal principal?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this even be a thing? Intuitively, this feels like 2 principals to me, or at least I'd not be surprised if it were treated that way. Grant me the same permissions twice and or share with myself (my two principals) feels acceptable.

authenticates via both corporate SSO and GitHub, how are those identities
linked to a single internal principal?

- Should per-namespace quota limits be hard (reject sandbox creation) or soft

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a reasonable configuration option (to be implemented at any time)

also be namespace-scoped from the start, or should they remain global and be
extended later as the organizational model matures?

- In operator mode, should the driver validate that the target Kubernetes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail seems right here. Even if you check, it could go away by the time you try to make it.

|------|-------------|
| **Platform Admin** | Manages gateway configuration, auth providers, compute drivers, and quotas. Full visibility across all namespaces. |
| **Namespace Admin** | Manages users, providers, policies, and quotas within a single namespace. Cannot change gateway infra or access other namespaces. |
| **Operator** | Read-only view of all sandboxes and audit logs across namespaces for monitoring, incident response, and compliance. Cannot create or modify sandboxes. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still like the term Auditor for this. Maybe operator means the same in other (kube?) communities?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like auditor as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 auditor

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will auditor/operator have elevated security privileges? It could be a security concern for Sandboxed applications with sensitive data/credentials.
At the same time, how are they helping enforce compliance?

currently does not have a durable store beyond configuration files.

- Which resources beyond sandboxes are namespace-scoped? Sandboxes are the
primary namespaced resource. Should settings, policies, and provider configs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier in the doc you said this:

Providers belong to a namespace.

Should probably ask the agent to make sure the entire document agrees with itself.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, good catch! will update.

Signed-off-by: Derek Carr <decarr@redhat.com>
@derekwaynecarr derekwaynecarr force-pushed the decarr/multi-player-design branch from 3713b9b to 85e9054 Compare June 23, 2026 14:16
@drew

drew commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Could we rename this to RFC-0011? I'll reserve the number in our tracker 😄.

- Gives a clear security boundary (namespace) without over-modeling
organizational hierarchy.
- Allows multiple overlapping groupings within a namespace via labels.
- Reuses Kubernetes-style patterns that users already understand.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not necessarily against this, but I think assuming users of AI Agents already understand kubernetes design patterns might be a stretch.

unique-within-namespace. Existing sandboxes are backfilled into the `default`
namespace. All existing single-player behavior continues unchanged.

- **Phase 2: Kubernetes driver — managed mode (default).** The driver creates

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about building out a podman driver version of this too. The scenario I have in mind is where an user/student/homelabber who wants to tinker with their team or learn on their own with minimal setup and admin overhead could spin up a linux VM or cloud instance, install the openshell-gateway rpm, run the openshell gateway service, create an openshell namespace for themselves or each member of their team. This obviously won't scale and is a single point of failure, but could be an interesting means to test it out and provides an easy/simple path to "my agents aren't running on my laptop".

Basically one local linux user openshell, one rootless Podman socket, one gateway process, many OpenShell namespaces, one Podman network per OpenShell namespace, one workspace volume per sandbox, no arbitrary bind mounts, namespace-scoped volumes/providers, gateway-enforced RBAC/quotas, OIDC/API-key auth required, gateway enforced quotas, OCSF attribution, and a strict shared-host driver mode that the user would have to opt into.

Or maybe that's a fools errand and the answer is just to show people how to do this with kind, minikube, k3s, or microshift. Thoughts? 🤔

@johntmyers johntmyers changed the title docs(rfc): add RFC 1977 multi-player support design docs(rfc): add RFC-0011 multi-player support design Jun 25, 2026
- Gives a clear security boundary (namespace) without over-modeling
organizational hierarchy.
- Allows multiple overlapping groupings within a namespace via labels.
- Reuses Kubernetes-style patterns that users already understand.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, namespace might become overloaded and ambiguous w.r.t kubernetes. workspace or project might be a more product-oriented term to use.

Disambiguating also keeps the door open if we want to evolve this unit in a way that doesn't match Kubernetes namespace semantics. For example, maybe a Sandbox can create a temp workspace to spawn multiple subagents in. In that case, we might not want a 1:1 mapping between OpenShell and Kubernetes namespaces.

FWIW I don't have a strong objection to using namespace either, just curious to explore alternatives since naming is hard and this is largely a one way door.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drew naming is hard, i am happy with workspace as well.

Comment on lines +134 to +135
their namespace when creating a sandbox. Users cannot see raw credential
material; they reference providers by name. Namespace Admins grant specific

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Users cannot see raw credential material" should other roles be able to see raw credential materials? Eg a service account role can decrypt credentials to be used by the supervisor.

Comment on lines +110 to +111
- **Users** can only exec into, delete, or view sandboxes they own within their
namespace.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Users** can only exec into, delete, or view sandboxes they own within their
namespace.
- **Users** can only create, exec into, delete, or view sandboxes they own within their
namespace.

Is it correct to say users have full CRUD access over all resources that are owned by them?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User should be granted access to Sandboxes using standard K8s resource rbac, correct?

Comment on lines +119 to +122
A User can share a sandbox with another user within the same namespace
(read-only or exec access) without making it globally visible. Platform Admins
can grant targeted cross-namespace access for specific use cases (e.g., a shared
services namespace).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how we represent this in the data model and enable it in the UX. Eg do we want to create explicit share method? Do we capture and store a list of principals each resource is shared with?

Also wondering about transitive resources. For example, if I shared a Sandbox with providers attached, do those transitive providers also become shared?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drew i had imagined a share method with a list of principals, and likely some type of shared-with-me grpc endpoint to see those explicit sandboxes.

Comment on lines +149 to +161
### Audit Trail

- **Control-plane audit log.** Every mutating gRPC call (`CreateSandbox`,
`DeleteSandbox`, `CreateProvider`, `UpdatePolicy`) emits an OCSF
`ConfigStateChange` or `ApiActivity` event with the authenticated principal,
action, target resource, and timestamp. Builds on the existing OCSF
infrastructure.
- **Session attribution.** Sandbox activity (network, process, SSH events)
tagged with the creating principal's subject, so security teams can trace
sandbox behavior back to a human or service account.
- **Audit log export.** Structured OCSF JSONL shipped to SIEM/log aggregation.
Operators can query "who created sandbox X" or "what did user Y do between T1
and T2."

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feel somewhat orthogonal to multiplayer-design. I wonder if we could break this out into a separate effort. Does anything related to auditing depend on the outcome of multi-player mode?

Comment on lines +165 to +167
- **Per-namespace quotas.** Max concurrent sandboxes, max GPU allocations, max
sandbox lifetime per namespace. Enforced at the gateway before sandbox
creation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you see us representing these values? Maybe as properties on namespace data model?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually was debating dropping some of these quota elements, or thinking about them in terms of an out-of-band add-on. I had put the use cases here primarily to drive discussion as we figure out project boundary. In general, I think to protect the openshell control/data-plane, we will need some way to protect against DoS or abuse of a target compute driver, so some type of quota system is potentially useful.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to keeping this an out-of-band add-on rather than core, and to framing it as
DoS/abuse protection rather than chargeback. That's how we run our multi-tenant agent
fleet at DeepInfra: quota lives in the control plane and the compute layer stays a simple
driver.

Comment on lines +234 to +236
- **Agent orchestration.** One agent's service account creates sandboxes for
sub-agents, each getting their own sandbox principal. The parent service
account retains visibility.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the parent service account create "sub" service accounts?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does each sandbox have to use a unique sercice account or can it be shared, but still maintain a unique agent identity using the pod name in it's SPIFFE id for example?

@dhirajsb

dhirajsb commented Jul 1, 2026

Copy link
Copy Markdown

We are missing a persona in this approach. CMIW, but the user persona as modeled sounds more like an end-user who's directly interacting with an agent in a Sandbox.

OpenShell multi-user architecture should support agentic app development for app developers. It should leverage K8s rbac for access control in app namespaces. It should also decouple policy management through Gateway in such a way that app developers or app workloads (or compromised workloads) can't override certain org/platform wide security policies enforced via Gateway.

I hope that makes sense.

Namespace Admin tightens a policy, does it retroactively affect shared
sandboxes that were created under the looser policy?

- What is the storage backend for API keys and quota state? The gateway

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may already be answered on main. There's a durable object store now (SQLite/Postgres, with optimistic concurrency), and #1577 layered a reconciler lease on top of it for HA. API key and quota state could build on that, so this question might be closable, or narrowed to just what those records should look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

7 participants