CAGRA: fix concurrent initialization and usage of dataset descriptor by achirkin · Pull Request #2237 · rapidsai/cuvs

achirkin · 2026-06-11T15:21:22Z

When CAGRA index runs search in multiple independent streams using the same raft::resources handle, it could happen that the dataset descriptor kernel in one stream finishes later than its result is used in CAGRA search in another stream.
Currently, we protect against the concurrent initialization on the host only. The PR adds stream ordering to make the search kernel wait for the initialization on the device side.

Note, this is all singe-device concurrency; the dataset descriptors are not shared between GPUs, because they are cached in raft::resources custom resource, and we enforce one-resources-handle-per-device.

Possibly related bugs: #1720, https://github.com/rapidsai/dlfw/issues/286

… stream

copy-pr-bot · 2026-06-11T15:21:26Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

achirkin · 2026-06-11T15:21:31Z

/ok to test

…-use

coderabbitai · 2026-06-15T13:13:28Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 970383cf-6909-44c4-8bbe-59fb80a60a44

📥 Commits

Reviewing files that changed from the base of the PR and between 6672103 and 25b4494.

📒 Files selected for processing (1)

cpp/src/neighbors/detail/cagra/compute_distance.hpp

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Improved synchronization reliability for device memory initialization to ensure correct operation across concurrent workloads.

Walkthrough

The change adds a cudaEvent_t ready_event member to dataset_descriptor_host::state. The event is created at construction, recorded on the initialization stream after device descriptor allocation completes, and destroyed in the destructor. The get method now calls cudaStreamWaitEvent when the requesting stream differs from the initialization stream.

Changes

Cross-stream device descriptor synchronization

Layer / File(s)	Summary
`ready_event` lifecycle and cross-stream wait `cpp/src/neighbors/detail/cagra/compute_distance.hpp`	`state` gains a `cudaEvent_t ready_event` created with `cudaEventCreateWithFlags` in the constructor and destroyed in the destructor. `eval` records the event on the init stream after the device descriptor is allocated and initialized. `get` calls `cudaStreamWaitEvent` before returning the descriptor pointer when the caller stream differs from the init stream.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the main change: fixing concurrent initialization and usage of the dataset descriptor in CAGRA, which matches the core objective of adding stream ordering to prevent race conditions.
Description check	✅ Passed	The description clearly explains the concurrency issue in CAGRA when multiple streams use the same resources handle, and describes the fix (adding device-side stream ordering) that matches the code changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-cagra-concurrent-dataset-descriptor-init-use

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…-use

Make concurrent streams wait on the dataset descriptor initialization…

afc63a7

… stream

achirkin self-assigned this Jun 11, 2026

achirkin added bug Something isn't working non-breaking Introduces a non-breaking change labels Jun 11, 2026

github-project-automation Bot added this to Unstructured Data Processing Jun 11, 2026

Merge branch 'main' into fix-cagra-concurrent-dataset-descriptor-init…

25b4494

…-use

achirkin marked this pull request as ready for review June 15, 2026 13:10

achirkin requested a review from a team as a code owner June 15, 2026 13:10

Merge branch 'main' into fix-cagra-concurrent-dataset-descriptor-init…

d07c3c5

…-use

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAGRA: fix concurrent initialization and usage of dataset descriptor#2237

CAGRA: fix concurrent initialization and usage of dataset descriptor#2237
achirkin wants to merge 3 commits into
mainfrom
fix-cagra-concurrent-dataset-descriptor-init-use

achirkin commented Jun 11, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

achirkin commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

achirkin commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

achirkin commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

achirkin commented Jun 11, 2026 •

edited

Loading