[FEA] Roaring-bitmap prefilters for brute-force search (shared and per-query) by maxwbuckley · Pull Request #2240 · rapidsai/cuvs

maxwbuckley · 2026-06-12T05:52:23Z

[FEA] Roaring-bitmap prefilters for brute-force search (shared and per-query)

Closes #1972.

Problem

cuVS encodes search prefilters as flat bit arrays: bitset_filter
(n_rows bits shared by all queries) and bitmap_filter
(n_queries x n_rows bits, one row per query). Below ~40-50%
selectivity — the dominant production regime (tag predicates, ACLs,
recency cutoffs) — this leaves performance and memory on the table:

The filter costs n_rows/8 bytes regardless of how few bits are set
(bitmap_filter: times n_queries; 125 GB at n_queries=1000,
n_rows=1e9).
The sparse pipeline pays a count() reduction + host sync per search
to pick a branch, then converts bitmap→CSR twice (once in cuVS,
once again inside raft::sparse::linalg::masked_matmul).
There is no path that computes only the selected rows densely — the
mid-selectivity regime (1-30%) runs either a full-dataset GEMM + mask
or a structure-blind SDDMM, both far from optimal.

What this PR adds

A compressed Roaring-bitmap filter representation plus a
selectivity-driven three-regime dispatch for brute-force search.

API (mirrors the existing filter API)

#include <cuvs/core/roaring.hpp>
#include <cuvs/neighbors/roaring_filter.hpp>

// build once per filter; cardinality + container metadata stay host-side
auto bm = cuvs::core::from_sorted_ids(res, ids, n_ids, n_rows);

// one filter shared by all queries (counterpart of bitset_filter)
auto f = cuvs::neighbors::filtering::roaring_filter(bm);
cuvs::neighbors::brute_force::search(res, params, index, queries,
                                     neighbors, distances, f);

// one filter per query (counterpart of bitmap_filter)
std::vector<const cuvs::core::gpu_roaring*> per_query = {...};
auto mf = cuvs::neighbors::filtering::roaring_matrix_filter(
  per_query.data(), n_queries);
cuvs::neighbors::brute_force::search(res, params, index, queries,
                                     neighbors, distances, mf);

cuvs::core::gpu_roaring is RAII (rmm-backed), movable, and also
provides set_and/set_or/multi_and/multi_or for composing predicate
bitmaps on-GPU and to_bitset() for interop with the existing filters.

Dispatch (no count kernels — cardinalities are known on the host)

selectivity s	path	why
s ≤ t_sparse(dim)	container ids → CSR (indptr free from host cardinalities, one emission kernel) → `sddmm` → sparse `select_k`	nnz-proportional work
t_sparse < s < 0.45 (shared filter)	ids → chunked row gather → dense GEMM → select_k → id remap	computes `
s ≥ 0.45	decompress to bitset / bit-matrix, delegate to the existing dense masked pipeline	already optimal when most rows pass

t_sparse is dimension-dependent because cusparse SDDMM degrades with
dim while dense GEMM does not: measured crossovers are ~3% at d=128
and ~0.1% at d=512 (RTX 5090); encoded as dim >= 256 ? 0.001 : 0.03.

Measured (RTX 5090, fp32 IP, k=10, recall 1.0 in every cell)

Shared filter, 10M x 512d, 64 queries, vs bitset_filter today:

s	uniform	clustered
0.01%	4.5x	5.2x
0.1%	11.0x	11.3x
1%	18.7x	18.7x
10%	18.2x	18.4x
30%	1.7x	1.6x
≥50%	1.0x (delegates)	1.0x

Per-query filters, 10M x 128d, 256 queries, vs bitmap_filter today:
2.8-6.6x at s ≤ 1% (and 1.4-3.1x across 1-6.25%), with the dense
[n_queries, n_rows] bitmap never materialized below the dense regime.

Filter memory at 10M rows: bitset fixed 1.25 MB; roaring 3 KB at 0.01%
uniform, 2-64 KB clustered (the common production shape), parity only
when genuinely dense.

Tests

cpp/tests/neighbors/brute_force_roaring.cu (added to NEIGHBORS_TEST):
20 parameterized cases cross-validating roaring_filter against
bitset_filter and roaring_matrix_filter against bitmap_filter
(tie-tolerant distance comparison + filter-membership assertions) across
InnerProduct / L2Expanded / CosineExpanded, d ∈ {64, 512}, and
selectivities exercising all three dispatch regimes. All pass.

Limitations / follow-ups

float32 datasets only (matches the existing filtered-path half
limitation); half/int8 follow-up.
from_sorted_ids builds containers host-side (construction is
one-time per filter; a device-side builder is a follow-up).
Negated (complement) bitmaps are rejected by the filters.
CRoaring portable-serialization import (ecosystem interop) is a
natural follow-up — the format is container-compatible.
RUN containers are supported by the read paths but not yet produced
by construction.
IVF-Flat per-candidate roaring filtering (measured +12-22% at 5-10%
selectivity) is a separate follow-up PR.

Files

New:

cpp/include/cuvs/core/roaring.hpp — gpu_roaring, roaring_view
(device test()), construction, set ops, decompression, CSR emission
cpp/src/core/roaring/roaring.cu — implementation
cpp/include/cuvs/neighbors/roaring_filter.hpp — the two filters
cpp/src/neighbors/detail/knn_brute_force_roaring.cuh — dispatch +
the three pipelines
cpp/tests/neighbors/brute_force_roaring.cu

Modified:

cpp/include/cuvs/neighbors/common.hpp — FilterType::{Roaring, RoaringMatrix}
cpp/src/neighbors/detail/knn_brute_force.cuh — include + two
dynamic_cast dispatch cases in detail::search; default argument of
brute_force_search_filtered moved to its first declaration
cpp/CMakeLists.txt, cpp/tests/CMakeLists.txt — source/test
registration

🤖 Generated with Claude Code

…query) Adds cuvs::neighbors::filtering::roaring_filter (counterpart of bitset_filter) and roaring_matrix_filter (counterpart of bitmap_filter) over a cuvs::core::gpu_roaring compressed filter type, with a selectivity-driven three-regime brute-force dispatch: - very sparse: container ids -> CSR (indptr free from host-side cardinalities, one emission kernel, no count syncs) -> sddmm -> sparse select_k - mid (shared filter): gather selected rows -> dense GEMM -> select_k -> id remap (computes |filter| columns instead of n_rows) - dense (s >= 0.45): decompress and delegate to the existing bitset/bitmap pipeline Sparse/mid threshold is dimension-dependent (~3% at d=128, ~0.1% at d=512, measured on RTX 5090). Measured vs the stock paths: 11-19x for the shared filter at 0.1-10% selectivity (10Mx512d, Q=64), 2.8-6.6x for per-query filters at s<=1% (10Mx128d, Q=256), parity at >=50% via delegation; recall 1.0 everywhere. Filter memory scales with filter cardinality/structure (3KB-64KB typical at 10M rows vs a fixed 1.25MB bitset; the dense [n_queries, n_rows] bit matrix is never materialized below the dense regime). Tests: 20 parameterized gtest cases cross-validating both filters against bitset_filter/bitmap_filter across IP/L2/Cosine, d in {64,512}, and all three dispatch regimes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

copy-pr-bot · 2026-06-12T05:52:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-project-automation Bot added this to Unstructured Data Processing Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Roaring-bitmap prefilters for brute-force search (shared and per-query)#2240

[FEA] Roaring-bitmap prefilters for brute-force search (shared and per-query)#2240
maxwbuckley wants to merge 1 commit into
rapidsai:mainfrom
maxwbuckley:roaring-filters

maxwbuckley commented Jun 12, 2026

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maxwbuckley commented Jun 12, 2026

[FEA] Roaring-bitmap prefilters for brute-force search (shared and per-query)

Problem

What this PR adds

API (mirrors the existing filter API)

Dispatch (no count kernels — cardinalities are known on the host)

Measured (RTX 5090, fp32 IP, k=10, recall 1.0 in every cell)

Tests

Limitations / follow-ups

Files

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant