Skip to content

[common] Fix BinaryRowSerializer reuse buffer never shrinking#8160

Merged
JingsongLi merged 3 commits into
apache:masterfrom
yugan95:fix/binaryrow-serializer-shrink
Jun 11, 2026
Merged

[common] Fix BinaryRowSerializer reuse buffer never shrinking#8160
JingsongLi merged 3 commits into
apache:masterfrom
yugan95:fix/binaryrow-serializer-shrink

Conversation

@yugan95

@yugan95 yugan95 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: #7620

During external merge sort, each merge channel holds a BinaryRow reuse instance via BinaryRowSerializer.deserialize(reuse, source). When a large record is deserialized, the backing MemorySegment grows to fit it but is never shrunk for subsequent small records. With max-num-file-handles (default 128) channels each retaining a 100MB+ buffer, memory usage explodes into OOM.

Changes

  • BinaryRowSerializer: add shrink logic with combined cap and ratio hysteresis in deserialize(BinaryRow reuse, DataInputView source) — reallocate only when both conditions are met:
  bufferSize > MAX_RETAINED_REUSE_BUFFER_SIZE (4MB)
      && bufferSize > recordLength * SHRINK_RATIO (4)
  • 100MB buffer + 5MB record → ratio 20x > 4x → shrink
  • 10MB buffer + 5MB record → ratio 2x < 4x → retain (no thrashing)
  • 3MB buffer + tiny record → below 4MB cap → retain

The ratio-based approach avoids thrashing for sustained medium-to-large records while still reclaiming memory after a spike. Consistent with the approach used in #8159.

Tests

BinaryRowSerializerShrinkTest — 6 test cases covering:

  • Spike followed by small record → shrink
  • Spike followed by medium record → shrink (ratio check)
  • Buffer proportional to record size → retain (ratio < 4x, even if > 4MB)
  • Small buffer → retain
  • Buffer grows when needed
  • Consecutive large records → retain (no thrashing)

API and Format

N/A

Documentation

N/A

@yugan95

yugan95 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Updated shrink logic to use a combined fixed-cap and ratio check, consistent with #8159:

  bufferSize > 4MB && bufferSize > recordLength * 4

This replaces the previous fixed-threshold-only approach (bufferSize > 4MB && length < 4MB) which would miss cases like a 100MB buffer with 5MB records (ratio 20x, clearly worth shrinking but previously retained because 5MB > 4MB).

Tests updated to 6 cases covering the ratio-based scenarios.

* Maximum retained reuse buffer size in bytes. Buffers exceeding this cap are eligible for
* shrinking when the shrink ratio condition is also met.
*/
private static final int MAX_RETAINED_REUSE_BUFFER_SIZE = 4 * 1024 * 1024; // 4MB

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reuse fields in #8159

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Reused RowHelper.MAX_RETAINED_REUSE_BUFFER_SIZE and RowHelper.SHRINK_RATIO from #8159, removed the duplicate definitions in BinaryRowSerializer.

@yugan95 yugan95 force-pushed the fix/binaryrow-serializer-shrink branch from badb0b8 to c55d98b Compare June 11, 2026 07:05
@yugan95 yugan95 requested a review from JingsongLi June 11, 2026 07:09

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 64b0534 into apache:master Jun 11, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants