Skip to content

[FEA] Roaring-bitmap prefilters for brute-force search (shared and per-query)#2240

Draft
maxwbuckley wants to merge 1 commit into
rapidsai:mainfrom
maxwbuckley:roaring-filters
Draft

[FEA] Roaring-bitmap prefilters for brute-force search (shared and per-query)#2240
maxwbuckley wants to merge 1 commit into
rapidsai:mainfrom
maxwbuckley:roaring-filters

Conversation

@maxwbuckley

Copy link
Copy Markdown
Contributor

[FEA] Roaring-bitmap prefilters for brute-force search (shared and per-query)

Closes #1972.

Problem

cuVS encodes search prefilters as flat bit arrays: bitset_filter
(n_rows bits shared by all queries) and bitmap_filter
(n_queries x n_rows bits, one row per query). Below ~40-50%
selectivity — the dominant production regime (tag predicates, ACLs,
recency cutoffs) — this leaves performance and memory on the table:

  1. The filter costs n_rows/8 bytes regardless of how few bits are set
    (bitmap_filter: times n_queries; 125 GB at n_queries=1000,
    n_rows=1e9).
  2. The sparse pipeline pays a count() reduction + host sync per search
    to pick a branch, then converts bitmap→CSR twice (once in cuVS,
    once again inside raft::sparse::linalg::masked_matmul).
  3. There is no path that computes only the selected rows densely — the
    mid-selectivity regime (1-30%) runs either a full-dataset GEMM + mask
    or a structure-blind SDDMM, both far from optimal.

What this PR adds

A compressed Roaring-bitmap filter representation plus a
selectivity-driven three-regime dispatch for brute-force search.

API (mirrors the existing filter API)

#include <cuvs/core/roaring.hpp>
#include <cuvs/neighbors/roaring_filter.hpp>

// build once per filter; cardinality + container metadata stay host-side
auto bm = cuvs::core::from_sorted_ids(res, ids, n_ids, n_rows);

// one filter shared by all queries (counterpart of bitset_filter)
auto f = cuvs::neighbors::filtering::roaring_filter(bm);
cuvs::neighbors::brute_force::search(res, params, index, queries,
                                     neighbors, distances, f);

// one filter per query (counterpart of bitmap_filter)
std::vector<const cuvs::core::gpu_roaring*> per_query = {...};
auto mf = cuvs::neighbors::filtering::roaring_matrix_filter(
  per_query.data(), n_queries);
cuvs::neighbors::brute_force::search(res, params, index, queries,
                                     neighbors, distances, mf);

cuvs::core::gpu_roaring is RAII (rmm-backed), movable, and also
provides set_and/set_or/multi_and/multi_or for composing predicate
bitmaps on-GPU and to_bitset() for interop with the existing filters.

Dispatch (no count kernels — cardinalities are known on the host)

selectivity s path why
s ≤ t_sparse(dim) container ids → CSR (indptr free from host cardinalities, one emission kernel) → sddmm → sparse select_k nnz-proportional work
t_sparse < s < 0.45 (shared filter) ids → chunked row gather → dense GEMM → select_k → id remap computes `
s ≥ 0.45 decompress to bitset / bit-matrix, delegate to the existing dense masked pipeline already optimal when most rows pass

t_sparse is dimension-dependent because cusparse SDDMM degrades with
dim while dense GEMM does not: measured crossovers are ~3% at d=128
and ~0.1% at d=512 (RTX 5090); encoded as dim >= 256 ? 0.001 : 0.03.

Measured (RTX 5090, fp32 IP, k=10, recall 1.0 in every cell)

Shared filter, 10M x 512d, 64 queries, vs bitset_filter today:

s uniform clustered
0.01% 4.5x 5.2x
0.1% 11.0x 11.3x
1% 18.7x 18.7x
10% 18.2x 18.4x
30% 1.7x 1.6x
≥50% 1.0x (delegates) 1.0x

Per-query filters, 10M x 128d, 256 queries, vs bitmap_filter today:
2.8-6.6x at s ≤ 1% (and 1.4-3.1x across 1-6.25%), with the dense
[n_queries, n_rows] bitmap never materialized below the dense regime.

Filter memory at 10M rows: bitset fixed 1.25 MB; roaring 3 KB at 0.01%
uniform, 2-64 KB clustered (the common production shape), parity only
when genuinely dense.

Tests

cpp/tests/neighbors/brute_force_roaring.cu (added to NEIGHBORS_TEST):
20 parameterized cases cross-validating roaring_filter against
bitset_filter and roaring_matrix_filter against bitmap_filter
(tie-tolerant distance comparison + filter-membership assertions) across
InnerProduct / L2Expanded / CosineExpanded, d ∈ {64, 512}, and
selectivities exercising all three dispatch regimes. All pass.

Limitations / follow-ups

  • float32 datasets only (matches the existing filtered-path half
    limitation); half/int8 follow-up.
  • from_sorted_ids builds containers host-side (construction is
    one-time per filter; a device-side builder is a follow-up).
  • Negated (complement) bitmaps are rejected by the filters.
  • CRoaring portable-serialization import (ecosystem interop) is a
    natural follow-up — the format is container-compatible.
  • RUN containers are supported by the read paths but not yet produced
    by construction.
  • IVF-Flat per-candidate roaring filtering (measured +12-22% at 5-10%
    selectivity) is a separate follow-up PR.

Files

New:

  • cpp/include/cuvs/core/roaring.hppgpu_roaring, roaring_view
    (device test()), construction, set ops, decompression, CSR emission
  • cpp/src/core/roaring/roaring.cu — implementation
  • cpp/include/cuvs/neighbors/roaring_filter.hpp — the two filters
  • cpp/src/neighbors/detail/knn_brute_force_roaring.cuh — dispatch +
    the three pipelines
  • cpp/tests/neighbors/brute_force_roaring.cu

Modified:

  • cpp/include/cuvs/neighbors/common.hppFilterType::{Roaring, RoaringMatrix}
  • cpp/src/neighbors/detail/knn_brute_force.cuh — include + two
    dynamic_cast dispatch cases in detail::search; default argument of
    brute_force_search_filtered moved to its first declaration
  • cpp/CMakeLists.txt, cpp/tests/CMakeLists.txt — source/test
    registration

🤖 Generated with Claude Code

…query)

Adds cuvs::neighbors::filtering::roaring_filter (counterpart of
bitset_filter) and roaring_matrix_filter (counterpart of bitmap_filter)
over a cuvs::core::gpu_roaring compressed filter type, with a
selectivity-driven three-regime brute-force dispatch:

- very sparse: container ids -> CSR (indptr free from host-side
  cardinalities, one emission kernel, no count syncs) -> sddmm ->
  sparse select_k
- mid (shared filter): gather selected rows -> dense GEMM -> select_k
  -> id remap (computes |filter| columns instead of n_rows)
- dense (s >= 0.45): decompress and delegate to the existing
  bitset/bitmap pipeline

Sparse/mid threshold is dimension-dependent (~3% at d=128, ~0.1% at
d=512, measured on RTX 5090). Measured vs the stock paths: 11-19x for
the shared filter at 0.1-10% selectivity (10Mx512d, Q=64), 2.8-6.6x for
per-query filters at s<=1% (10Mx128d, Q=256), parity at >=50% via
delegation; recall 1.0 everywhere. Filter memory scales with filter
cardinality/structure (3KB-64KB typical at 10M rows vs a fixed 1.25MB
bitset; the dense [n_queries, n_rows] bit matrix is never materialized
below the dense regime).

Tests: 20 parameterized gtest cases cross-validating both filters
against bitset_filter/bitmap_filter across IP/L2/Cosine, d in {64,512},
and all three dispatch regimes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[FEA] Roaring Bitmap support

1 participant