feat(cli): add GPU count requests#1812
Conversation
|
🌿 Preview your docs: https://nvidia-preview-pr-1812.docs.buildwithfern.com/openshell |
|
/ok to test abe5b79 |
|
/ok to test abe5b79 |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Landing #1815 first should simplify the changes here. |
BREAKING CHANGE: SandboxSpec.gpu and DriverSandboxSpec.gpu were replaced with resource_requirements.gpu, changing protobuf field 9 from a bool to a message for both public and driver APIs. Signed-off-by: Evan Lezar <elezar@nvidia.com>
abe5b79 to
06c69dd
Compare
|
Label |
PR Review StatusValidation: this is maintainer-authored, project-valid GPU CLI/API/runtime work that aligns GPU sandbox intent with structured resource requirements and the related resource-requirements RFC direction. Review findings:
Docs: Fern docs were updated under Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
06c69dd to
87c9a8c
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Re-check After CI UpdateI re-evaluated latest head Disposition: partially resolved. Remaining items:
Next state: |
The breaking proto change is intentional. However, I will defer to @drew on whether we should rather reserve the previous |
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Pass the coupled GPU requirement object through the CLI sandbox_create boundary instead of splitting presence and count into separate arguments. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Pass ResourceRequirements into the provisioning timeout message helper so GPU hints are derived from the same nested request object used to create the sandbox. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Thread Option<GpuResourceRequirements> through driver validation and rendering helpers instead of splitting GPU presence and count into separate arguments. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Require exact driver GPU device lists to be tied to a GPU request, allow a single exact device to use the default countless request, and require explicit matching counts for multi-device lists. Signed-off-by: Evan Lezar <elezar@nvidia.com>
87c9a8c to
da6fbd8
Compare
Re-check After Author UpdateI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Re-check After CI and Independent ReviewI re-evaluated latest head Disposition: not resolved. Remaining items:
Next state: |
Summary
Adds structured GPU resource requirements for sandbox creation and updates the CLI/API path so
openshell sandbox create --gpu [COUNT]records GPU intent inResourceRequirements.gpu.This is a breaking proto compatibility change: the public and compute-driver sandbox specs now carry
resource_requirementsand reserve the previous flat GPU fields. The shape aligns the implementation withrfc/0004-sandbox-resource-requirements/README.mdand #1360.Related Issue
Part of #1444. Related to #1338, #1156, and #1360. Follow-up GPU support preflight semantics are tracked in #1807.
Changes
ResourceRequirements.gpu.countto the public and compute-driver protos, and reserve the previous GPU device fields.--gpufor the driver default request and--gpu COUNTfor counted requests.nvidia.com/gpulimits from GPU requirements; default--gpurequests one GPU, and--gpu COUNTrequests that count.driver_config: Docker/Podman usecdi_devices, and VM usesgpu_device_ids.--gpu, and multi-device exact lists require--gpu COUNTmatching the list length.Testing
mise run pre-commitpassesFocused GPU checks also run during local review:
/Users/elezar/.cargo/bin/cargo test -p openshell-core gpu::tests/Users/elezar/.cargo/bin/cargo test -p openshell-driver-docker cdi_device/Users/elezar/.cargo/bin/cargo test -p openshell-driver-podman cdi_device/Users/elezar/.cargo/bin/cargo test -p openshell-driver-vm gpu_deviceChecklist