Skip to content

[common] Fix OOM when writing/compacting table with large records#7621

Closed
yugan95 wants to merge 3 commits into
apache:masterfrom
yugan95:record-0410
Closed

[common] Fix OOM when writing/compacting table with large records#7621
yugan95 wants to merge 3 commits into
apache:masterfrom
yugan95:record-0410

Conversation

@yugan95

@yugan95 yugan95 commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #7620
Fix OOM when writing table with large records (100MB+) and many buckets (e.g. 256) due to unbounded buffer growth in sort, merge and compaction paths. Each bucket's writer independently holds its own sort buffer, merge channels, and compaction readers. When a large record inflates an internal reuse buffer, that bloated buffer is retained per-bucket, causing memory usage to quickly exceed available heap.

Heap dump analysis identified three independent root causes:

1. Sort path — RowHelper internal buffer never shrinks

RowHelper.reuseWriter grows its internal MemorySegment for large records, but BinaryRowWriter.reset() only resets the cursor without releasing the oversized segment. Additionally, InternalRowSerializer.serialize() can exit via EOFException (a normal signal when the sort buffer is full), skipping any cleanup of the bloated buffer.

2. Merge path — BinaryRowSerializer.deserialize(reuse) only grows, never shrinks

Each merge channel holds a BinaryRow reuse instance. When a large record is deserialized, the backing MemorySegment grows to fit it but is never shrunk for subsequent small records. With max-num-file-handles (default 128) channels each retaining a 100MB+ buffer, memory usage explodes.

3. Compaction read path — HeapBytesVector.reserveBytes() integer overflow

reserveBytes() computes newCapacity * 2 using plain multiplication. When newCapacity exceeds ~1.07 billion bytes, this overflows Integer.MAX_VALUE, causing NegativeArraySizeException or silent data corruption.

Changes

  1. RowHelper: add resetIfTooLarge() with hysteresis — release internal buffer only when the segment exceeds 4MB and the last written record is smaller than 4MB. This avoids thrashing for sustained large-record workloads while reclaiming memory when the workload transitions back to small records.
  2. InternalRowSerializer: call resetIfTooLarge() in finally block of serialize() and serializeToPages() to handle EOFException exit path.
  3. BinaryRowSerializer: add shrink logic with hysteresis in deserialize(reuse) — reallocate only when existing buffer > 4MB and current record < 4MB.
  4. HeapBytesVector: promote to long before capacity arithmetic, extract calculateNewBytesCapacity(long) helper, cap at MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8, return exact required capacity when doubling would exceed max, throw clear error on overflow.

Note: The Parquet config pass-through (RowDataParquetBuilder) has been moved to #7956 as it is an independent enhancement.

Tests

  • RowHelperTest — validates hysteresis: buffer retained for large records, released only on transition to small records
  • BinaryRowSerializerShrinkTest — validates hysteresis: buffer retained for consecutive large records, shrunk on transition to small records
  • HeapBytesVectorReserveBytesTest — validates overflow-safe reserveBytes() growth, calculateNewBytesCapacity edge cases (doubling, exact capacity, overflow), and data correctness

API and Format

N/A — no public API or format changes.

Documentation

N/A

@yugan95 yugan95 changed the title [core] Fix OOM when writing/compacting table with large records [common] Fix OOM when writing/compacting table with large records Apr 10, 2026
@yugan95 yugan95 changed the title [common] Fix OOM when writing/compacting table with large records [common][format] Fix OOM when writing/compacting table with large records Apr 10, 2026
@JingsongLi

Copy link
Copy Markdown
Contributor

cc @LsomeYeah to take a review.

@leaves12138 leaves12138 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed fix. The RowHelper / InternalRowSerializer cleanup and BinaryRowSerializer shrink path look reasonable to me, and the focused common tests pass locally:

mvn -pl paimon-common -Dtest=RowHelperTest,HeapBytesVectorReserveBytesTest,BinaryRowSerializerShrinkTest -DskipITs -Dcheckstyle.skip -Drat.skip=true -Dspotless.check.skip=true -DfailIfNoTests=false test

However, I think the HeapBytesVector.reserveBytes part still needs changes before merging:

  1. When newCapacity > MAX_ARRAY_SIZE >> 1, the code allocates MAX_ARRAY_SIZE bytes. For example, a ~1.1GB required capacity would try to allocate ~2GB, which can create a new OOM risk. This also contradicts the comment saying it falls back to the exact required capacity. It should allocate the exact required capacity when doubling would exceed the max.

  2. The required capacity is still computed with int arithmetic before entering reserveBytes: bytesAppended + length in putByteArray and start.length * value.length in fill. These can overflow to a negative or smaller positive value before the MAX_ARRAY_SIZE guard sees them, so the guard can be skipped and the failure later becomes an unclear ArrayIndexOutOfBoundsException / bad offset path. Please compute required capacity with long (or make reserveBytes accept long) and throw the clear error before downcasting.

It would also be good to add tests for these arithmetic edge cases without requiring huge allocations, e.g. by extracting the capacity-growth calculation into a small helper or otherwise testing the overflow checks directly.

@yugan95

yugan95 commented May 14, 2026

Copy link
Copy Markdown
Contributor Author

@leaves12138 Thanks for the review! All three points are addressed in the updated commit:

  1. Exact capacity when doubling exceeds max

When requiredCapacity > MAX_ARRAY_SIZE >> 1, the code now allocates the exact required capacity instead of MAX_ARRAY_SIZE. This avoids the unnecessary ~2GB allocation for a ~1.1GB request.

  1. Long arithmetic for required capacity

reserveBytes now accepts long. The callers compute required capacity with long arithmetic:

putByteArray: (long) bytesAppended + length
fill: (long) start.length * value.length
If the result exceeds MAX_ARRAY_SIZE, a clear RuntimeException is thrown before any downcast to int, so an int overflow can never silently bypass the guard.

  1. Extracted helper + edge-case tests

The capacity-growth logic is extracted into a package-private static int calculateNewBytesCapacity(long) method, tested directly with 7 new cases covering:

Normal doubling for small values
Boundary at exactly MAX_ARRAY_SIZE >> 1 (still doubles)
Just above MAX_ARRAY_SIZE >> 1 (returns exact capacity, not MAX_ARRAY_SIZE)
Exactly MAX_ARRAY_SIZE (returns exact)
Above MAX_ARRAY_SIZE (throws)
Simulated int-overflow values via long (throws)
No huge allocations needed — all edge cases are tested purely through the helper.

@yugan95 yugan95 requested a review from leaves12138 May 14, 2026 08:04

@LsomeYeah LsomeYeah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed fix! The heap-dump-based root cause breakdown is very clear. The HeapBytesVector overflow fix and Parquet config pass-through both look great.

We have a few concerns about the RowHelper / BinaryRowSerializer parts though — would love to hear your thoughts:

  1. 4MB threshold lacks a basis. Data patterns vary a lot across workloads — a table with uniform 1MB records, one with mixed 1KB/100MB records, and one with sustained 5MB records all behave very differently around a fixed 4MB cutoff.
  2. Reuse mechanism effectively defeated near the threshold. For sustained 5–10MB records, every call triggers release-and-rebuild or shrink-and-reallocate, paying allocation cost on every record. This defeats the purpose of these reuse buffers.
  3. Hot-path behavior change with no opt-out. The finally { resetIfTooLarge() } and the new shrink branch change default behavior for every user on upgrade, not only those experiencing OOM.
  4. Prior art. Worth checking how Spark's UnsafeExternalSorter or similar engines handle reuse-buffer growth — there may be a more principled heuristic (hysteresis, memory-pressure-based release) that avoids the thrash pattern.

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent bug analysis. The heap dump investigation identifying four independent root causes is thorough. Comments per fix:

1. RowHelper buffer release (resetIfTooLarge):

  • The finally block in InternalRowSerializer.serialize() is critical — EOFException as a normal signal is a subtle but real issue. Good catch.
  • The 4MB threshold is reasonable. Consider making it configurable via a system property for operators with different memory profiles, or at least document why 4MB was chosen (e.g., 256 buckets × 4MB = 1GB baseline, which leaves headroom).
  • resetIfTooLarge() checks reuseWriter.getSegments().size() — is this the number of segments or the total bytes? If it's segment count, the threshold of 4MB doesn't directly match.

2. BinaryRowSerializer shrink:

  • The shrink-on-read approach (replacing oversized buffer during deserialize) is correct. The threshold matches RowHelper's.
  • Note: shrinking on every deserialize when the current record is small but the buffer is > 4MB means we're allocating every time until a large record appears again. This is acceptable since the alternative (OOM) is worse.

3. HeapBytesVector overflow fix:

  • Promoting to long before multiplication is the right fix. The calculateNewBytesCapacity method with MAX_ARRAY_SIZE fallback is clean.
  • Good that you made this static and package-visible for testing.

4. Parquet config pass-through:

  • This is a separate concern from the OOM fix. Could this be a standalone PR? It changes the behavior for all Parquet writes, not just large-record scenarios.

Test coverage looks comprehensive with dedicated test classes for each fix. Well done.

One concern: The resetIfTooLarge() adds overhead on every serialize call (a null check + size check). For normal-sized records this should be negligible, but verify there's no measurable regression in microbenchmarks for small-record-heavy workloads.

@yugan95

yugan95 commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

@LsomeYeah Thanks for the thoughtful review! All four points are addressed in the updated commit:

1. 4MB threshold basis

The threshold is derived from the production scenario that surfaced this bug: 256 buckets × 4MB = 1GB baseline, which leaves reasonable headroom in a typical 4–8GB TaskManager heap. 4MB is high enough to avoid interfering with normal workloads (most records are well under 1MB) while preventing the pathological accumulation we saw in heap dumps (256 × 100MB+ = tens of GB).

2. Thrashing near the threshold — fixed with hysteresis

Good catch — this was a real gap. I've updated both RowHelper.resetIfTooLarge() and BinaryRowSerializer.deserialize(reuse) with a hysteresis guard:

  • RowHelper: only releases when buffer > threshold && lastRecord < threshold
  • BinaryRowSerializer: only shrinks when buffer > threshold && currentRecordLength < threshold

This means:

  • Sustained large records (5–10MB): buffer is retained, no thrashing
  • Occasional large record → back to small records: buffer is released, OOM protection kicks in
  • Normal small records (< 4MB): buffer never exceeds threshold, the check is a complete no-op

Tests updated to verify the hysteresis behavior in both classes.

3. Hot-path overhead

With hysteresis, the overhead on the hot path is 1 null check + 2 int comparisons — all on fields already in L1 cache from the preceding serialize(). For normal-sized records the buffer never reaches 4MB, so resetIfTooLarge() short-circuits at the size check. The JIT will inline this. No branch is taken and no allocation occurs.

4. Prior art

The hysteresis approach here is inspired by the same principle behind Spark's UnsafeExternalSorter — retain capacity when the workload demands it, release when it doesn't. A full memory-pressure-based release (integrating with Paimon's managed memory pool) would be a larger scope change; the simple hysteresis guard covers the production failure mode while keeping the change minimal and safe.

Also, the Parquet config pass-through has been moved to #7956.

@yugan95

yugan95 commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

@JingsongLi Thanks for the detailed per-fix comments! Addressing each point:

1. RowHelper — getSegments().size() semantics

Good question. AbstractBinaryWriter.getSegments() returns a single MemorySegment (not a list/array — the method name is slightly misleading), and MemorySegment.size() returns the byte size of the underlying buffer. So the comparison reuseWriter.getSegments().size() > 4MB correctly checks byte count, and the threshold matches.

I've also added a hysteresis guard per @LsomeYeah's feedback: resetIfTooLarge() now additionally checks reuseRow.getSizeInBytes() < threshold, so it only releases when the last written record was small. This avoids thrashing for sustained large-record workloads while still reclaiming memory when the workload transitions back.

Configurability: I investigated making the threshold a proper table-level option via CoreOptions. However, these thresholds live in RowHelper and BinaryRow Serializer in the paimon-common module, which is entirely config-free by design — all 27 serializer classes are parameterized solely by schema types (DataType, RowType, numFields), and none imports any Options or Configuration class. To route a CoreOptions value down to these classes, the full call chain would need plumbing changes:

CoreOptions (paimon-api)
  -> KeyValueFileStoreWrite / MergeSorter (paimon-core, has CoreOptions)
    -> SortBufferWriteBuffer / BinaryInMemorySortBuffer (no config today)
      -> InternalRowSerializer (paimon-common, only accepts DataType[])
        -> RowHelper (only accepts List<DataType>)
        -> BinaryRowSerializer (only accepts int numFields)

This would break the config-free invariant of paimon-common's serializer layer and touch a wide set of classes across modules. I think a fixed 4MB is sufficient for the initial fix — the value is grounded in a concrete production scenario (256 buckets x 4MB = 1GB), and the hysteresis guard makes the threshold much less sensitive. If a follow-up is needed, it could be done as a standalone PR.

2. BinaryRowSerializer shrink — now improved with hysteresis. The length < threshold guard means consistent large-record workloads reuse the buffer without reallocation, addressing the "allocating every time" concern you noted.

3. HeapBytesVector — addressed in commit f1954ff.

4. Parquet config pass-through — moved to #7956.

5. Performance overhead — the check is 1 null check + 2 int comparisons, all on hot fields already in cache. For small-record workloads (buffer < 4MB), resetIfTooLarge() short-circuits at the first size check — no branch taken, no allocation. This is comparable to an array bounds check and will be inlined by the JIT. Happy to run a JMH benchmark if the team wants hard numbers, but I don't expect a measurable delta.

@yugan95 yugan95 changed the title [common][format] Fix OOM when writing/compacting table with large records [common] Fix OOM when writing/compacting table with large records May 25, 2026
JingsongLi pushed a commit that referenced this pull request May 25, 2026
…n RowDataParquetBuilder (#7956)

Pass through parquet statistics and page-size-check configuration in
`RowDataParquetBuilder`.

Currently `RowDataParquetBuilder` does not forward the following Parquet
config keys to the writer:
  - `parquet.statistics.truncate.length`
  - `parquet.columnindex.truncate.length`
  - `parquet.page.size.row.check.min`
  - `parquet.page.size.row.check.max`

Without these, users cannot tune Parquet page-size checking behavior or
control the truncation length of statistics and column indexes. This is
especially relevant for tables with large records, where the default
page-size check thresholds can lead to oversized pages.

Split out from #7621 per reviewer feedback, as this is an independent
enhancement.
@yugan95 yugan95 requested review from JingsongLi and LsomeYeah May 26, 2026 02:00

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the most critical link, can you break it down into smaller points to contribute? Each PR only completes one simple task, and each task is difficult to choose.

@yugan95

yugan95 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@JingsongLi Good suggestion — I've split this into three independent PRs, one per root cause:

  1. HeapBytesVector reserveBytes() integer overflow — promote to long arithmetic, extract calculateNewBytesCapacity(long) helper, cap at MAX_ARRAY_SIZE. [common] Fix integer overflow in HeapBytesVector.reserveBytes() #8158
  2. RowHelper / InternalRowSerializer buffer release — add resetIfTooLarge() with hysteresis in finally block to handle EOFException exit path. [common] Fix RowHelper internal buffer never shrinking for large records #8159
  3. BinaryRowSerializer reuse buffer shrink — add hysteresis shrink in deserialize(reuse) to reclaim oversized buffers when workload transitions to small records. [common] Fix BinaryRowSerializer reuse buffer never shrinking #8160

The Parquet config pass-through is already merged in #7956.

Each PR is self-contained with its own tests and no cross-dependencies.

@yugan95

yugan95 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

All fixes have been split into independent PRs and merged:

  1. [format] Pass through parquet statistics and page-size-check config in RowDataParquetBuilder #7956 — Parquet config pass-through
  2. [common] Fix integer overflow in HeapBytesVector.reserveBytes() #8158 — HeapBytesVector reserveBytes() integer overflow
  3. [common] Fix RowHelper internal buffer never shrinking for large records #8159 — RowHelper internal buffer never shrinking
  4. [common] Fix BinaryRowSerializer reuse buffer never shrinking #8160 — BinaryRowSerializer reuse buffer never shrinking

Closing this PR. Thanks for the reviews!

@yugan95 yugan95 closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] OOM when writing table with large records (100MB+) due to unbounded buffer growth in sort, merge and compaction paths

4 participants