Improve balanced k-means rebalancing by anaruse · Pull Request #2222 · rapidsai/cuvs

anaruse · 2026-06-09T05:48:27Z

Motivation

The current balanced k-means rebalancing heuristic is not always effective at reducing oversized partitions.

This PR improves the rebalancing heuristic, introduces separate lower and upper balance tolerances, and adds a C++ example for evaluating partition balance.

Changes

Balanced k-means parameters

Added separate lower and upper balance tolerance parameters:

balance_lower_tolerance
balance_upper_tolerance

The thresholds are computed as:

lower_threshold = average_partition_size * balance_lower_tolerance
upper_threshold = average_partition_size * balance_upper_tolerance

This allows underflow and overflow handling to be tuned independently.

Rebalancing improvements

Improved the rebalancing heuristic by explicitly pairing underfull and overfull partitions.

New centroids are created near oversized partitions, resulting in a more balanced final partition size distribution.

Example updates

Added a C++ example for evaluating balanced k-means partition balance. The example:

accepts multiple lower and upper tolerance values in a single run
runs regular k-means once as a baseline
reports partition size statistics for regular and balanced k-means
reports underflow and overflow partition counts
prints histograms using a shared range for easier comparison

Documentation

Updated the C++ API documentation and example documentation for the new balanced k-means parameters and defaults.

Testing

Built libcuvs successfully
Built the balanced k-means C++ example successfully
Evaluated SIFT-1M, GloVe, Wiki, and DEEP datasets
Compared partition size statistics between regular and balanced k-means

No issue is closed by this PR.

- Add balance tolerance and centroid offset parameters - Rework center adjustment to split oversized partitions more effectively - Document tolerance limits for heuristic rebalancing - Add a balanced k-means example with regular k-means comparison

Split balanced k-means tolerance into lower and upper bounds so users can control underflow and overflow thresholds independently. Update the balanced k-means example to evaluate multiple tolerance combinations in one run and report clearer partition size statistics, including shared-range histograms. Also update the documentation for the new parameters and their defaults.

copy-pr-bot · 2026-06-09T05:48:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-09T05:53:19Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b8081179-aeb9-4b10-b587-dab763d48611

📥 Commits

Reviewing files that changed from the base of the PR and between 30f4502 and 519316d.

📒 Files selected for processing (2)

cpp/src/cluster/detail/kmeans_balanced.cuh
examples/cpp/src/balanced_kmeans_example.cu

🚧 Files skipped from review as they are similar to previous changes (2)

examples/cpp/src/balanced_kmeans_example.cu
cpp/src/cluster/detail/kmeans_balanced.cuh

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added configurable balanced k-means tuning: balance_lower_tolerance, balance_upper_tolerance, and centroid_offset.
- Introduced a runnable “Balanced k-means” example that partitions data and reports partition statistics and comparisons vs standard k-means.
Documentation
- Updated K-Means configuration documentation to include the new balanced k-means parameters and explain their impact.

Walkthrough

This PR adds three balanced-kmeans hyperparameters (balance_lower_tolerance, balance_upper_tolerance, centroid_offset) and refactors center adjustment from single-threshold to paired donor/receiver cluster logic, updating kernels, host logic, EM integration, documentation, a complete example program, and CMake build configuration.

Changes

Balanced K-Means Parameter Extension and Algorithm Refactoring

Layer / File(s)	Summary
New Parameter Types and Configuration `cpp/include/cuvs/cluster/kmeans.hpp`	Three new `float` fields (`balance_lower_tolerance`, `balance_upper_tolerance`, `centroid_offset`) are added to `balanced_params` with default values and documented valid ranges.
Paired Cluster Center Adjustment Kernel and Host Function `cpp/src/cluster/detail/kmeans_balanced.cuh`	Adds STL includes; replaces the per-cluster adjust kernel with a paired receiver/donor kernel that uses `centroid_offset`; rewrites `adjust_centers` to compute avg/tolerance bounds, sort/select cluster pairs on host, upload indices, launch the paired kernel, and verify all pairs updated.
Balancing EM Loop and Call Site Integration `cpp/src/cluster/detail/kmeans_balanced.cuh`	`balancing_em_iters` signature updated to accept separate lower/upper tolerances and `centroid_offset` with `RAFT_EXPECTS` range checks; call sites in `build_clusters` and `build_hierarchical` updated to use the new tolerance parameters.
Algorithm Documentation Updates `cpp/src/cluster/kmeans_balanced.cuh`	Rewords "Balancing" documentation to describe underfull/overfull cluster adjustment and directional movement logic instead of single-threshold wording.
User-Facing Documentation `examples/README.md`, `fern/pages/cluster/kmeans.md`	Documents the three new parameters in examples README (with command invocation, supported data types/formats, and CLI flag meanings) and in the kmeans documentation parameter table.
Balanced K-Means Example Program Implementation `examples/cpp/src/balanced_kmeans_example.cu`	Adds complete CUDA C++ example with argp CLI parsing, dataset auto-detection/loading (BIGANN/XVECS), partition-size statistics, regular k-means baseline (float only), balanced k-means runs for each tolerance pair, partition histograms, and balance-improvement metrics.
Example Build Configuration `examples/cpp/CMakeLists.txt`	Adds `BALANCED_KMEANS_EXAMPLE` target compiling `src/balanced_kmeans_example.cu` and links it to `cuvs::cuvs` and optional `conda_env`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

rapidsai/cuvs#2005: Exposes balanced_params through PQ's kmeans_params_variant to allow Product Quantization to configure and invoke balanced k-means, building on the extended parameter structure from this PR.

Suggested labels

improvement, non-breaking, C++

Suggested reviewers

tarang-jain
viclafargue

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Improve balanced k-means rebalancing' is directly related to the main change: improving the rebalancing heuristic with separate tolerances and a new example.
Description check	✅ Passed	The description comprehensively covers all major changes including new parameters, rebalancing improvements, example additions, and documentation updates.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

examples/README.md (1)

34-34: 💤 Low value

Minor grammar: hyphenate compound adjective.

"one third" should be "one-third" when used as a compound adjective modifying "to three times."

✏️ Suggested fix

-outside roughly one third to three times the average partition size. Very strict upper tolerance
+outside roughly one-third to three times the average partition size. Very strict upper tolerance

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/README.md` at line 34, Replace the unhyphenated compound adjective
"one third" with "one-third" in the README sentence that currently reads
"outside roughly one third to three times the average partition size. Very
strict upper tolerance" so it correctly uses the hyphenated form for a compound
adjective.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/README.md`:
- Line 34: Replace the unhyphenated compound adjective "one third" with
"one-third" in the README sentence that currently reads "outside roughly one
third to three times the average partition size. Very strict upper tolerance" so
it correctly uses the hyphenated form for a compound adjective.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7b5c286d-1519-4be8-97e0-23556feb5e12

📥 Commits

Reviewing files that changed from the base of the PR and between a71b165 and d11ff26.

📒 Files selected for processing (7)

cpp/include/cuvs/cluster/kmeans.hpp
cpp/src/cluster/detail/kmeans_balanced.cuh
cpp/src/cluster/kmeans_balanced.cuh
examples/README.md
examples/cpp/CMakeLists.txt
examples/cpp/src/balanced_kmeans_example.cu
fern/pages/cluster/kmeans.md

anaruse · 2026-06-09T05:55:29Z

I evaluated the updated balanced k-means implementation on the Wiki dataset (1M vectors, 1000 partitions).

The previous implementation (roughly corresponding to balance_lower_tolerance=0.25 and balance_upper_tolerance=4.0) already improves partition balance significantly, reducing the maximum partition size from 15.4× the average to 3.75× and lowering the standard deviation from 1039 to 374.

The new default configuration (balance_lower_tolerance=0.333, balance_upper_tolerance=3.0) further improves the partition size distribution, reducing the maximum partition size to 2.73× the average and lowering the standard deviation to 334.

For applications that require tighter balance, more aggressive settings (balance_lower_tolerance=0.5, balance_upper_tolerance=2.0) can further reduce the maximum partition size to 2.34× the average and lower the standard deviation to 268.

These results show that the updated rebalancing heuristic produces a more balanced partition size distribution and that the newly exposed lower and upper balance tolerances provide a useful mechanism for trading off balance strictness against flexibility.

# dtype: float
# partitions: 1000
# iterations: 20
# balance_lower_tolerances: 0.25 0.333 0.5
# balance_upper_tolerances: 4 3 2
# centroid_offset: 0.01
Partitioning 1000000 vectors with 768 dimensions into 1000 balanced partitions

Regular k-means partition size statistics: min=1, max=15367, median=786.5, mean=1000, standard deviation=1039.03, min/mean=0.001, max/mean=15.367
Regular k-means partition size histogram:
  [       1,      155]   12 | ##
  [     154,      309]   37 | ########
  [     308,      463]  115 | ###########################
  [     462,      617]  152 | ####################################
  [     616,      771]  166 | ########################################
  [     770,      925]  140 | #################################
  [     924,     1078]  101 | ########################
  [    1077,     1232]   77 | ##################
  [    1231,     1386]   51 | ############
  [    1385,     1540]   38 | #########
  [    1539,     1694]   20 | ####
  [    1693,     1848]   19 | ####
  [    1847,     2002]   17 | ####
  [    2001,     2155]   11 | ##
  [    2154,     2309]    4 | #
  [    2308,     2463]    1 | #
  [    2462,     2617]    6 | #
  [    2616,     2771]    0 | 
  [    2770,     2925]    6 | #
  [    2924,     3079]    4 | #
  (    3079,      inf]   23 | #####

# balance_lower_tolerance: 0.25
# balance_upper_tolerance: 4
Balanced k-means partition size statistics: min=228, max=3751, median=945, mean=1000, standard deviation=373.842, min/mean=0.228, max/mean=3.751, underflow=1 (< 250), overflow=0 (> 4000)
Balanced k-means partition size histogram:
  [       1,      155]    0 | 
  [     154,      309]    4 | #
  [     308,      463]   32 | #####
  [     462,      617]   73 | #############
  [     616,      771]  150 | ###########################
  [     770,      925]  220 | ########################################
  [     924,     1078]  183 | #################################
  [    1077,     1232]  129 | #######################
  [    1231,     1386]   86 | ###############
  [    1385,     1540]   51 | #########
  [    1539,     1694]   26 | ####
  [    1693,     1848]   16 | ##
  [    1847,     2002]   11 | ##
  [    2001,     2155]    9 | #
  [    2154,     2309]    2 | #
  [    2308,     2463]    1 | #
  [    2462,     2617]    2 | #
  [    2616,     2771]    1 | #
  [    2770,     2925]    2 | #
  [    2924,     3079]    0 | 
  (    3079,      inf]    2 | #
Balance improvement: max/mean 15.367 -> 3.751, standard deviation 1039.03 -> 373.842

# balance_lower_tolerance: 0.333
# balance_upper_tolerance: 3
Balanced k-means partition size statistics: min=183, max=2730, median=975.5, mean=1000, standard deviation=333.736, min/mean=0.183, max/mean=2.73, underflow=1 (< 333), overflow=0 (> 3000)
Balanced k-means partition size histogram:
  [       1,      155]    0 | 
  [     154,      309]    1 | #
  [     308,      463]   26 | #####
  [     462,      617]   84 | ################
  [     616,      771]  142 | ############################
  [     770,      925]  179 | ###################################
  [     924,     1078]  201 | ########################################
  [    1077,     1232]  156 | ###############################
  [    1231,     1386]  104 | ####################
  [    1385,     1540]   46 | #########
  [    1539,     1694]   29 | #####
  [    1693,     1848]   14 | ##
  [    1847,     2002]    6 | #
  [    2001,     2155]    6 | #
  [    2154,     2309]    2 | #
  [    2308,     2463]    3 | #
  [    2462,     2617]    0 | 
  [    2616,     2771]    1 | #
  [    2770,     2925]    0 | 
  [    2924,     3079]    0 | 
  (    3079,      inf]    0 | 
Balance improvement: max/mean 15.367 -> 2.73, standard deviation 1039.03 -> 333.736

# balance_lower_tolerance: 0.5
# balance_upper_tolerance: 2
Balanced k-means partition size statistics: min=357, max=2336, median=983.5, mean=1000, standard deviation=267.705, min/mean=0.357, max/mean=2.336, underflow=8 (< 500), overflow=1 (> 2000)
Balanced k-means partition size histogram:
  [       1,      155]    0 | 
  [     154,      309]    0 | 
  [     308,      463]    3 | #
  [     462,      617]   61 | ##########
  [     616,      771]  140 | #######################
  [     770,      925]  201 | #################################
  [     924,     1078]  241 | ########################################
  [    1077,     1232]  168 | ###########################
  [    1231,     1386]  107 | #################
  [    1385,     1540]   43 | #######
  [    1539,     1694]   22 | ###
  [    1693,     1848]   10 | #
  [    1847,     2002]    3 | #
  [    2001,     2155]    0 | 
  [    2154,     2309]    0 | 
  [    2308,     2463]    1 | #
  [    2462,     2617]    0 | 
  [    2616,     2771]    0 | 
  [    2770,     2925]    0 | 
  [    2924,     3079]    0 | 
  (    3079,      inf]    0 | 
Balance improvement: max/mean 15.367 -> 2.336, standard deviation 1039.03 -> 267.705

dantegd

The improvement approach looks solid to me, really nice! Just had some questions and smaller comments around the example and code

dantegd · 2026-06-12T18:02:59Z

+                                   raft::make_const_mdspan(regular_labels.view()),
+                                   balance_lower_tolerances.front(),
+                                   balance_upper_tolerances.front());
+    print_partition_size_summary("Regular k-means", regular_reference_stats.value());


Small thing: the regular baseline only prints the summary once, while balanced stats are evaluated per tolerance pair, would it be useful to report regular k-means underflow/overflow counts for each lower/upper tolerance combination too?

That is a good point. Regular k-means itself does not use these tolerance values, but I originally reported the regular baseline underflow/overflow counts for each balanced k-means tolerance pair to make the comparison direct.

In practice, that made the output quite verbose when evaluating multiple tolerance combinations, so I changed the example to print the regular k-means summary once as a baseline and then print the balanced k-means stats for each tolerance pair.

Compute balance thresholds from a floating-point average, avoid pairing against empty donor clusters, and perform candidate index arithmetic with int64_t intermediates. Fix xvec dataset handling in the balanced k-means example by accounting for per-row dimension headers and reading each row from the beginning.

anaruse · 2026-06-15T08:00:11Z

Thanks for the review! I addressed the main feedback in the latest commit and replied to the regular baseline reporting comment. Could you take another look when you have a chance?

anaruse added 2 commits June 9, 2026 12:48

anaruse requested review from a team as code owners June 9, 2026 05:48

github-project-automation Bot added this to Unstructured Data Processing Jun 9, 2026

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

anaruse added 2 commits June 10, 2026 12:51

Merge branch 'main' into main.improve_balanced_kmeans

d4de499

Fix balanced k-means example wording

30f4502

dantegd requested changes Jun 12, 2026

View reviewed changes

Conversation

anaruse commented Jun 9, 2026

Motivation

Changes

Balanced k-means parameters

Rebalancing improvements

Example updates

Documentation

Testing

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

anaruse commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dantegd Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

anaruse Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anaruse commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

anaruse commented Jun 9, 2026 •

edited

Loading