Skip to content

CAGRA: fix concurrent initialization and usage of dataset descriptor#2237

Open
achirkin wants to merge 3 commits into
mainfrom
fix-cagra-concurrent-dataset-descriptor-init-use
Open

CAGRA: fix concurrent initialization and usage of dataset descriptor#2237
achirkin wants to merge 3 commits into
mainfrom
fix-cagra-concurrent-dataset-descriptor-init-use

Conversation

@achirkin

@achirkin achirkin commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

When CAGRA index runs search in multiple independent streams using the same raft::resources handle, it could happen that the dataset descriptor kernel in one stream finishes later than its result is used in CAGRA search in another stream.
Currently, we protect against the concurrent initialization on the host only. The PR adds stream ordering to make the search kernel wait for the initialization on the device side.

Note, this is all singe-device concurrency; the dataset descriptors are not shared between GPUs, because they are cached in raft::resources custom resource, and we enforce one-resources-handle-per-device.

Possibly related bugs: #1720, https://github.com/rapidsai/dlfw/issues/286

@achirkin achirkin self-assigned this Jun 11, 2026
@achirkin achirkin added bug Something isn't working non-breaking Introduces a non-breaking change labels Jun 11, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@achirkin

Copy link
Copy Markdown
Contributor Author

/ok to test

@achirkin achirkin marked this pull request as ready for review June 15, 2026 13:10
@achirkin achirkin requested a review from a team as a code owner June 15, 2026 13:10
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 970383cf-6909-44c4-8bbe-59fb80a60a44

📥 Commits

Reviewing files that changed from the base of the PR and between 6672103 and 25b4494.

📒 Files selected for processing (1)
  • cpp/src/neighbors/detail/cagra/compute_distance.hpp

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes
    • Improved synchronization reliability for device memory initialization to ensure correct operation across concurrent workloads.

Walkthrough

The change adds a cudaEvent_t ready_event member to dataset_descriptor_host::state. The event is created at construction, recorded on the initialization stream after device descriptor allocation completes, and destroyed in the destructor. The get method now calls cudaStreamWaitEvent when the requesting stream differs from the initialization stream.

Changes

Cross-stream device descriptor synchronization

Layer / File(s) Summary
ready_event lifecycle and cross-stream wait
cpp/src/neighbors/detail/cagra/compute_distance.hpp
state gains a cudaEvent_t ready_event created with cudaEventCreateWithFlags in the constructor and destroyed in the destructor. eval records the event on the init stream after the device descriptor is allocated and initialized. get calls cudaStreamWaitEvent before returning the descriptor pointer when the caller stream differs from the init stream.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: fixing concurrent initialization and usage of the dataset descriptor in CAGRA, which matches the core objective of adding stream ordering to prevent race conditions.
Description check ✅ Passed The description clearly explains the concurrency issue in CAGRA when multiple streams use the same resources handle, and describes the fix (adding device-side stream ordering) that matches the code changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-cagra-concurrent-dataset-descriptor-init-use

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant