Improve balanced k-means rebalancing#2222
Conversation
- Add balance tolerance and centroid offset parameters - Rework center adjustment to split oversized partitions more effectively - Document tolerance limits for heuristic rebalancing - Add a balanced k-means example with regular k-means comparison
Split balanced k-means tolerance into lower and upper bounds so users can control underflow and overflow thresholds independently. Update the balanced k-means example to evaluate multiple tolerance combinations in one run and report clearer partition size statistics, including shared-range histograms. Also update the documentation for the new parameters and their defaults.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR adds three balanced-kmeans hyperparameters ( ChangesBalanced K-Means Parameter Extension and Algorithm Refactoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/README.md (1)
34-34: 💤 Low valueMinor grammar: hyphenate compound adjective.
"one third" should be "one-third" when used as a compound adjective modifying "to three times."
✏️ Suggested fix
-outside roughly one third to three times the average partition size. Very strict upper tolerance +outside roughly one-third to three times the average partition size. Very strict upper tolerance🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/README.md` at line 34, Replace the unhyphenated compound adjective "one third" with "one-third" in the README sentence that currently reads "outside roughly one third to three times the average partition size. Very strict upper tolerance" so it correctly uses the hyphenated form for a compound adjective.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/README.md`:
- Line 34: Replace the unhyphenated compound adjective "one third" with
"one-third" in the README sentence that currently reads "outside roughly one
third to three times the average partition size. Very strict upper tolerance" so
it correctly uses the hyphenated form for a compound adjective.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7b5c286d-1519-4be8-97e0-23556feb5e12
📒 Files selected for processing (7)
cpp/include/cuvs/cluster/kmeans.hppcpp/src/cluster/detail/kmeans_balanced.cuhcpp/src/cluster/kmeans_balanced.cuhexamples/README.mdexamples/cpp/CMakeLists.txtexamples/cpp/src/balanced_kmeans_example.cufern/pages/cluster/kmeans.md
|
I evaluated the updated balanced k-means implementation on the Wiki dataset (1M vectors, 1000 partitions). The previous implementation (roughly corresponding to balance_lower_tolerance=0.25 and balance_upper_tolerance=4.0) already improves partition balance significantly, reducing the maximum partition size from 15.4× the average to 3.75× and lowering the standard deviation from 1039 to 374. The new default configuration (balance_lower_tolerance=0.333, balance_upper_tolerance=3.0) further improves the partition size distribution, reducing the maximum partition size to 2.73× the average and lowering the standard deviation to 334. For applications that require tighter balance, more aggressive settings (balance_lower_tolerance=0.5, balance_upper_tolerance=2.0) can further reduce the maximum partition size to 2.34× the average and lower the standard deviation to 268. These results show that the updated rebalancing heuristic produces a more balanced partition size distribution and that the newly exposed lower and upper balance tolerances provide a useful mechanism for trading off balance strictness against flexibility. |
dantegd
left a comment
There was a problem hiding this comment.
The improvement approach looks solid to me, really nice! Just had some questions and smaller comments around the example and code
| raft::make_const_mdspan(regular_labels.view()), | ||
| balance_lower_tolerances.front(), | ||
| balance_upper_tolerances.front()); | ||
| print_partition_size_summary("Regular k-means", regular_reference_stats.value()); |
There was a problem hiding this comment.
Small thing: the regular baseline only prints the summary once, while balanced stats are evaluated per tolerance pair, would it be useful to report regular k-means underflow/overflow counts for each lower/upper tolerance combination too?
There was a problem hiding this comment.
That is a good point. Regular k-means itself does not use these tolerance values, but I originally reported the regular baseline underflow/overflow counts for each balanced k-means tolerance pair to make the comparison direct.
In practice, that made the output quite verbose when evaluating multiple tolerance combinations, so I changed the example to print the regular k-means summary once as a baseline and then print the balanced k-means stats for each tolerance pair.
Compute balance thresholds from a floating-point average, avoid pairing against empty donor clusters, and perform candidate index arithmetic with int64_t intermediates. Fix xvec dataset handling in the balanced k-means example by accounting for per-row dimension headers and reading each row from the beginning.
|
Thanks for the review! I addressed the main feedback in the latest commit and replied to the regular baseline reporting comment. Could you take another look when you have a chance? |
Motivation
The current balanced k-means rebalancing heuristic is not always effective at reducing oversized partitions.
This PR improves the rebalancing heuristic, introduces separate lower and upper balance tolerances, and adds a C++ example for evaluating partition balance.
Changes
Balanced k-means parameters
Added separate lower and upper balance tolerance parameters:
balance_lower_tolerancebalance_upper_toleranceThe thresholds are computed as:
This allows underflow and overflow handling to be tuned independently.
Rebalancing improvements
Improved the rebalancing heuristic by explicitly pairing underfull and overfull partitions.
New centroids are created near oversized partitions, resulting in a more balanced final partition size distribution.
Example updates
Added a C++ example for evaluating balanced k-means partition balance. The example:
Documentation
Updated the C++ API documentation and example documentation for the new balanced k-means parameters and defaults.
Testing
No issue is closed by this PR.