Propagate CUDA device IDs in Python DLPack tensors#2239
Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR updates the DLPack export layer to correctly detect and report the CUDA device ID associated with tensor pointers, replacing hardcoded zero with computed device introspection. Implementation extends CUDA runtime imports and adds device-querying helpers; tests validate the detection across CPU and GPU allocations with multi-GPU awareness. ChangesDLPack CUDA Device-ID Detection
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cuvs/cuvs/common/cydlpack.pyx (1)
95-105:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winCRITICAL: avoid allocating
DLManagedTensorbefore the fallible device query.Line 97 allocates
dlm, and Line 105 can now raise via_dlpack_device_id_c(ary). That exception path leaks the rawDLManagedTensorallocation because there is no cleanup before unwinding.Suggested fix
cdef DLManagedTensor* dlpack_c(ary): # todo(dgd): add checking options/parameters cdef DLDeviceType dev_type cdef DLDevice dev cdef DLDataType dtype cdef DLTensor tensor cdef uintptr_t tensor_ptr = <uintptr_t>ary.ai_["data"][0] + cdef int device_id = _dlpack_device_id_c(ary) cdef DLManagedTensor* dlm = \ <DLManagedTensor*>stdlib.malloc(sizeof(DLManagedTensor)) if ary.from_cai: dev_type = DLDeviceType.kDLCUDA @@ dev.device_type = dev_type - dev.device_id = _dlpack_device_id_c(ary) + dev.device_id = device_idAs per coding guidelines, "Catch memory leaks from improper resource management" and "Ensure GIL handling is correct for CUDA operations and exceptions are handled correctly across Python/C++ boundary in Cython bindings."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs/cuvs/common/cydlpack.pyx` around lines 95 - 105, The DLManagedTensor allocation (dlm via stdlib.malloc) is done before the fallible device query (_dlpack_device_id_c(ary)), which can raise and leak dlm; move the allocation of DLManagedTensor (dlm) until after you determine dev_type and successfully call _dlpack_device_id_c(ary), or wrap the device-id call in a try/except and free the malloc'ed dlm on error; specifically update the logic around ary.from_cai, dev.device_type/dev.device_id and the DLManagedTensor allocation/initialization so that dlm is only malloc'ed after _dlpack_device_id_c(ary) returns successfully (or ensure you call stdlib.free(dlm) before propagating any exception).Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@python/cuvs/cuvs/common/cydlpack.pyx`:
- Around line 95-105: The DLManagedTensor allocation (dlm via stdlib.malloc) is
done before the fallible device query (_dlpack_device_id_c(ary)), which can
raise and leak dlm; move the allocation of DLManagedTensor (dlm) until after you
determine dev_type and successfully call _dlpack_device_id_c(ary), or wrap the
device-id call in a try/except and free the malloc'ed dlm on error; specifically
update the logic around ary.from_cai, dev.device_type/dev.device_id and the
DLManagedTensor allocation/initialization so that dlm is only malloc'ed after
_dlpack_device_id_c(ary) returns successfully (or ensure you call
stdlib.free(dlm) before propagating any exception).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 356ff87a-af66-432c-a5bf-9316c0ad6c95
📒 Files selected for processing (2)
python/cuvs/cuvs/common/cydlpack.pyxpython/cuvs/cuvs/tests/test_device_tensor_view.py
Summary
cudaPointerGetAttributes()for CUDA array-interface inputs and keep host arrays on device ID 0.Why
The CUDA array interface does not carry a device ordinal, and the RAFT wrapper used here only preserves the interface dictionary. Hard-coding
DLDevice.device_id = 0mislabels arrays allocated on GPU 1 or higher, which can send downstream C API work through the wrong device context.Validation
pre-commit run --files python/cuvs/cuvs/common/cydlpack.pyx python/cuvs/cuvs/tests/test_device_tensor_view.pygit diff --checkpython3 -m compileall python/cuvs/cuvs/tests/test_device_tensor_view.pyI could not run the targeted pytest locally because this base Python environment does not have
pytestinstalled; the new CuPy regression is skipped automatically on single-GPU machines.