Skip to content

Improve balanced k-means rebalancing#2222

Open
anaruse wants to merge 5 commits into
rapidsai:mainfrom
anaruse:main.improve_balanced_kmeans
Open

Improve balanced k-means rebalancing#2222
anaruse wants to merge 5 commits into
rapidsai:mainfrom
anaruse:main.improve_balanced_kmeans

Conversation

@anaruse

@anaruse anaruse commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Motivation

The current balanced k-means rebalancing heuristic is not always effective at reducing oversized partitions.

This PR improves the rebalancing heuristic, introduces separate lower and upper balance tolerances, and adds a C++ example for evaluating partition balance.

Changes

Balanced k-means parameters

Added separate lower and upper balance tolerance parameters:

  • balance_lower_tolerance
  • balance_upper_tolerance

The thresholds are computed as:

lower_threshold = average_partition_size * balance_lower_tolerance
upper_threshold = average_partition_size * balance_upper_tolerance

This allows underflow and overflow handling to be tuned independently.

Rebalancing improvements

Improved the rebalancing heuristic by explicitly pairing underfull and overfull partitions.

New centroids are created near oversized partitions, resulting in a more balanced final partition size distribution.

Example updates

Added a C++ example for evaluating balanced k-means partition balance. The example:

  • accepts multiple lower and upper tolerance values in a single run
  • runs regular k-means once as a baseline
  • reports partition size statistics for regular and balanced k-means
  • reports underflow and overflow partition counts
  • prints histograms using a shared range for easier comparison

Documentation

Updated the C++ API documentation and example documentation for the new balanced k-means parameters and defaults.

Testing

  • Built libcuvs successfully
  • Built the balanced k-means C++ example successfully
  • Evaluated SIFT-1M, GloVe, Wiki, and DEEP datasets
  • Compared partition size statistics between regular and balanced k-means

No issue is closed by this PR.

anaruse added 2 commits June 9, 2026 12:48
  - Add balance tolerance and centroid offset parameters
  - Rework center adjustment to split oversized partitions more effectively
  - Document tolerance limits for heuristic rebalancing
  - Add a balanced k-means example with regular k-means comparison
Split balanced k-means tolerance into lower and upper bounds so users can
control underflow and overflow thresholds independently. Update the balanced
k-means example to evaluate multiple tolerance combinations in one run and
report clearer partition size statistics, including shared-range histograms.

Also update the documentation for the new parameters and their defaults.
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b8081179-aeb9-4b10-b587-dab763d48611

📥 Commits

Reviewing files that changed from the base of the PR and between 30f4502 and 519316d.

📒 Files selected for processing (2)
  • cpp/src/cluster/detail/kmeans_balanced.cuh
  • examples/cpp/src/balanced_kmeans_example.cu
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/cpp/src/balanced_kmeans_example.cu
  • cpp/src/cluster/detail/kmeans_balanced.cuh

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added configurable balanced k-means tuning: balance_lower_tolerance, balance_upper_tolerance, and centroid_offset.
    • Introduced a runnable “Balanced k-means” example that partitions data and reports partition statistics and comparisons vs standard k-means.
  • Documentation
    • Updated K-Means configuration documentation to include the new balanced k-means parameters and explain their impact.

Walkthrough

This PR adds three balanced-kmeans hyperparameters (balance_lower_tolerance, balance_upper_tolerance, centroid_offset) and refactors center adjustment from single-threshold to paired donor/receiver cluster logic, updating kernels, host logic, EM integration, documentation, a complete example program, and CMake build configuration.

Changes

Balanced K-Means Parameter Extension and Algorithm Refactoring

Layer / File(s) Summary
New Parameter Types and Configuration
cpp/include/cuvs/cluster/kmeans.hpp
Three new float fields (balance_lower_tolerance, balance_upper_tolerance, centroid_offset) are added to balanced_params with default values and documented valid ranges.
Paired Cluster Center Adjustment Kernel and Host Function
cpp/src/cluster/detail/kmeans_balanced.cuh
Adds STL includes; replaces the per-cluster adjust kernel with a paired receiver/donor kernel that uses centroid_offset; rewrites adjust_centers to compute avg/tolerance bounds, sort/select cluster pairs on host, upload indices, launch the paired kernel, and verify all pairs updated.
Balancing EM Loop and Call Site Integration
cpp/src/cluster/detail/kmeans_balanced.cuh
balancing_em_iters signature updated to accept separate lower/upper tolerances and centroid_offset with RAFT_EXPECTS range checks; call sites in build_clusters and build_hierarchical updated to use the new tolerance parameters.
Algorithm Documentation Updates
cpp/src/cluster/kmeans_balanced.cuh
Rewords "Balancing" documentation to describe underfull/overfull cluster adjustment and directional movement logic instead of single-threshold wording.
User-Facing Documentation
examples/README.md, fern/pages/cluster/kmeans.md
Documents the three new parameters in examples README (with command invocation, supported data types/formats, and CLI flag meanings) and in the kmeans documentation parameter table.
Balanced K-Means Example Program Implementation
examples/cpp/src/balanced_kmeans_example.cu
Adds complete CUDA C++ example with argp CLI parsing, dataset auto-detection/loading (BIGANN/XVECS), partition-size statistics, regular k-means baseline (float only), balanced k-means runs for each tolerance pair, partition histograms, and balance-improvement metrics.
Example Build Configuration
examples/cpp/CMakeLists.txt
Adds BALANCED_KMEANS_EXAMPLE target compiling src/balanced_kmeans_example.cu and links it to cuvs::cuvs and optional conda_env.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Possibly related PRs

  • rapidsai/cuvs#2005: Exposes balanced_params through PQ's kmeans_params_variant to allow Product Quantization to configure and invoke balanced k-means, building on the extended parameter structure from this PR.

Suggested labels

improvement, non-breaking, C++


Suggested reviewers

  • tarang-jain
  • viclafargue
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Improve balanced k-means rebalancing' is directly related to the main change: improving the rebalancing heuristic with separate tolerances and a new example.
Description check ✅ Passed The description comprehensively covers all major changes including new parameters, rebalancing improvements, example additions, and documentation updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/README.md (1)

34-34: 💤 Low value

Minor grammar: hyphenate compound adjective.

"one third" should be "one-third" when used as a compound adjective modifying "to three times."

✏️ Suggested fix
-outside roughly one third to three times the average partition size. Very strict upper tolerance
+outside roughly one-third to three times the average partition size. Very strict upper tolerance
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/README.md` at line 34, Replace the unhyphenated compound adjective
"one third" with "one-third" in the README sentence that currently reads
"outside roughly one third to three times the average partition size. Very
strict upper tolerance" so it correctly uses the hyphenated form for a compound
adjective.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/README.md`:
- Line 34: Replace the unhyphenated compound adjective "one third" with
"one-third" in the README sentence that currently reads "outside roughly one
third to three times the average partition size. Very strict upper tolerance" so
it correctly uses the hyphenated form for a compound adjective.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7b5c286d-1519-4be8-97e0-23556feb5e12

📥 Commits

Reviewing files that changed from the base of the PR and between a71b165 and d11ff26.

📒 Files selected for processing (7)
  • cpp/include/cuvs/cluster/kmeans.hpp
  • cpp/src/cluster/detail/kmeans_balanced.cuh
  • cpp/src/cluster/kmeans_balanced.cuh
  • examples/README.md
  • examples/cpp/CMakeLists.txt
  • examples/cpp/src/balanced_kmeans_example.cu
  • fern/pages/cluster/kmeans.md

@anaruse

anaruse commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

I evaluated the updated balanced k-means implementation on the Wiki dataset (1M vectors, 1000 partitions).

The previous implementation (roughly corresponding to balance_lower_tolerance=0.25 and balance_upper_tolerance=4.0) already improves partition balance significantly, reducing the maximum partition size from 15.4× the average to 3.75× and lowering the standard deviation from 1039 to 374.

The new default configuration (balance_lower_tolerance=0.333, balance_upper_tolerance=3.0) further improves the partition size distribution, reducing the maximum partition size to 2.73× the average and lowering the standard deviation to 334.

For applications that require tighter balance, more aggressive settings (balance_lower_tolerance=0.5, balance_upper_tolerance=2.0) can further reduce the maximum partition size to 2.34× the average and lower the standard deviation to 268.

These results show that the updated rebalancing heuristic produces a more balanced partition size distribution and that the newly exposed lower and upper balance tolerances provide a useful mechanism for trading off balance strictness against flexibility.

# dtype: float
# partitions: 1000
# iterations: 20
# balance_lower_tolerances: 0.25 0.333 0.5
# balance_upper_tolerances: 4 3 2
# centroid_offset: 0.01
Partitioning 1000000 vectors with 768 dimensions into 1000 balanced partitions

Regular k-means partition size statistics: min=1, max=15367, median=786.5, mean=1000, standard deviation=1039.03, min/mean=0.001, max/mean=15.367
Regular k-means partition size histogram:
  [       1,      155]   12 | ##
  [     154,      309]   37 | ########
  [     308,      463]  115 | ###########################
  [     462,      617]  152 | ####################################
  [     616,      771]  166 | ########################################
  [     770,      925]  140 | #################################
  [     924,     1078]  101 | ########################
  [    1077,     1232]   77 | ##################
  [    1231,     1386]   51 | ############
  [    1385,     1540]   38 | #########
  [    1539,     1694]   20 | ####
  [    1693,     1848]   19 | ####
  [    1847,     2002]   17 | ####
  [    2001,     2155]   11 | ##
  [    2154,     2309]    4 | #
  [    2308,     2463]    1 | #
  [    2462,     2617]    6 | #
  [    2616,     2771]    0 | 
  [    2770,     2925]    6 | #
  [    2924,     3079]    4 | #
  (    3079,      inf]   23 | #####

# balance_lower_tolerance: 0.25
# balance_upper_tolerance: 4
Balanced k-means partition size statistics: min=228, max=3751, median=945, mean=1000, standard deviation=373.842, min/mean=0.228, max/mean=3.751, underflow=1 (< 250), overflow=0 (> 4000)
Balanced k-means partition size histogram:
  [       1,      155]    0 | 
  [     154,      309]    4 | #
  [     308,      463]   32 | #####
  [     462,      617]   73 | #############
  [     616,      771]  150 | ###########################
  [     770,      925]  220 | ########################################
  [     924,     1078]  183 | #################################
  [    1077,     1232]  129 | #######################
  [    1231,     1386]   86 | ###############
  [    1385,     1540]   51 | #########
  [    1539,     1694]   26 | ####
  [    1693,     1848]   16 | ##
  [    1847,     2002]   11 | ##
  [    2001,     2155]    9 | #
  [    2154,     2309]    2 | #
  [    2308,     2463]    1 | #
  [    2462,     2617]    2 | #
  [    2616,     2771]    1 | #
  [    2770,     2925]    2 | #
  [    2924,     3079]    0 | 
  (    3079,      inf]    2 | #
Balance improvement: max/mean 15.367 -> 3.751, standard deviation 1039.03 -> 373.842

# balance_lower_tolerance: 0.333
# balance_upper_tolerance: 3
Balanced k-means partition size statistics: min=183, max=2730, median=975.5, mean=1000, standard deviation=333.736, min/mean=0.183, max/mean=2.73, underflow=1 (< 333), overflow=0 (> 3000)
Balanced k-means partition size histogram:
  [       1,      155]    0 | 
  [     154,      309]    1 | #
  [     308,      463]   26 | #####
  [     462,      617]   84 | ################
  [     616,      771]  142 | ############################
  [     770,      925]  179 | ###################################
  [     924,     1078]  201 | ########################################
  [    1077,     1232]  156 | ###############################
  [    1231,     1386]  104 | ####################
  [    1385,     1540]   46 | #########
  [    1539,     1694]   29 | #####
  [    1693,     1848]   14 | ##
  [    1847,     2002]    6 | #
  [    2001,     2155]    6 | #
  [    2154,     2309]    2 | #
  [    2308,     2463]    3 | #
  [    2462,     2617]    0 | 
  [    2616,     2771]    1 | #
  [    2770,     2925]    0 | 
  [    2924,     3079]    0 | 
  (    3079,      inf]    0 | 
Balance improvement: max/mean 15.367 -> 2.73, standard deviation 1039.03 -> 333.736

# balance_lower_tolerance: 0.5
# balance_upper_tolerance: 2
Balanced k-means partition size statistics: min=357, max=2336, median=983.5, mean=1000, standard deviation=267.705, min/mean=0.357, max/mean=2.336, underflow=8 (< 500), overflow=1 (> 2000)
Balanced k-means partition size histogram:
  [       1,      155]    0 | 
  [     154,      309]    0 | 
  [     308,      463]    3 | #
  [     462,      617]   61 | ##########
  [     616,      771]  140 | #######################
  [     770,      925]  201 | #################################
  [     924,     1078]  241 | ########################################
  [    1077,     1232]  168 | ###########################
  [    1231,     1386]  107 | #################
  [    1385,     1540]   43 | #######
  [    1539,     1694]   22 | ###
  [    1693,     1848]   10 | #
  [    1847,     2002]    3 | #
  [    2001,     2155]    0 | 
  [    2154,     2309]    0 | 
  [    2308,     2463]    1 | #
  [    2462,     2617]    0 | 
  [    2616,     2771]    0 | 
  [    2770,     2925]    0 | 
  [    2924,     3079]    0 | 
  (    3079,      inf]    0 | 
Balance improvement: max/mean 15.367 -> 2.336, standard deviation 1039.03 -> 267.705

@dantegd dantegd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvement approach looks solid to me, really nice! Just had some questions and smaller comments around the example and code

Comment thread cpp/src/cluster/detail/kmeans_balanced.cuh Outdated
Comment thread examples/cpp/src/balanced_kmeans_example.cu Outdated
Comment thread examples/cpp/src/balanced_kmeans_example.cu Outdated
raft::make_const_mdspan(regular_labels.view()),
balance_lower_tolerances.front(),
balance_upper_tolerances.front());
print_partition_size_summary("Regular k-means", regular_reference_stats.value());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing: the regular baseline only prints the summary once, while balanced stats are evaluated per tolerance pair, would it be useful to report regular k-means underflow/overflow counts for each lower/upper tolerance combination too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. Regular k-means itself does not use these tolerance values, but I originally reported the regular baseline underflow/overflow counts for each balanced k-means tolerance pair to make the comparison direct.

In practice, that made the output quite verbose when evaluating multiple tolerance combinations, so I changed the example to print the regular k-means summary once as a baseline and then print the balanced k-means stats for each tolerance pair.

Comment thread cpp/src/cluster/detail/kmeans_balanced.cuh
Comment thread cpp/src/cluster/detail/kmeans_balanced.cuh
Comment thread cpp/src/cluster/detail/kmeans_balanced.cuh Outdated
Compute balance thresholds from a floating-point average, avoid pairing
against empty donor clusters, and perform candidate index arithmetic with
int64_t intermediates.

Fix xvec dataset handling in the balanced k-means example by accounting
for per-row dimension headers and reading each row from the beginning.
@anaruse

anaruse commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review! I addressed the main feedback in the latest commit and replied to the regular baseline reporting comment. Could you take another look when you have a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants