[python] Support schema evolution of nested struct sub-fields by TheR1sing3un · Pull Request #8187 · apache/paimon

TheR1sing3un · 2026-06-10T02:25:45Z

Purpose

Follow-up to #8126, which made read-time schema evolution align top-level columns by field id. This extends the same id-based alignment to sub-fields inside a ROW (including a ROW nested in an ARRAY/MAP).

Before this PR, nested sub-field evolution didn't work: adding a sub-field silently created a top-level column, and rename/drop/update-type failed, because only the last name in the path was matched.

Now a dotted path like mv.value is resolved recursively, so for a column mv ROW<version BIGINT, value STRING>:

add a sub-field → old rows read NULL for it;
rename a sub-field → data follows the field id, not the name;
drop a sub-field → its old data is not revived;
update type of a sub-field → cast at read time.

# mv ROW<version BIGINT, value STRING>
catalog.alter_table("db.t", [SchemaChange.rename_column(["mv", "value"], "val")])
# files written under mv.value are read back as mv.val with the same data

How it works:

Nested sub-fields get globally-unique field ids at create time; highestFieldId is computed recursively so nested and top-level ids never collide.
Schema changes (add / rename / drop / update-type / update-nullability / update-comment) recurse along the field-name path, transparently through ARRAY/MAP wrappers.
update column type is validated against the cast-support rules.
The read path aligns nested sub-fields by id — reorder, pad missing with NULL, follow renames, cast changed types — recursing into struct / array / map.

Tests

New cases in schema_evolution_nested_read_test.py:

nested add / rename / drop / update-type read round-trips, on append-only and primary-key tables;
sub-fields of ARRAY<ROW> and MAP<.,ROW>;
the nested field-id model (global uniqueness, recursive highestFieldId, duplicate detection);
the type-cast support rules.

Read-time schema evolution previously aligned only top-level columns by field id; sub-fields inside a ROW (and a ROW nested in an ARRAY/MAP) could not evolve: adding one silently created a top-level column, and rename/drop/update-type raised because the schema manager only handled the last path element. - Assign globally-unique ids to nested sub-fields at create time and compute highestFieldId recursively, so nested ids never collide with top-level ones. - Recurse schema changes along the dotted field-name path (transparently through ARRAY/MAP wrappers) for add/rename/drop/update-type/update-nullability/ update-comment, allocating new ids from the persisted highestFieldId. - Validate update-column-type against the cast-support rules. - Align nested sub-fields by field id at read time: reorder, pad missing with NULL, follow renames, and cast changed types, recursing into struct/array/map. Add tests covering nested add/rename/drop/update-type round-trips (append-only and primary-key), ARRAY<ROW>/MAP<.,ROW> sub-fields, the id model, and the cast rules.

Nested-leaf projection on append-only reads pushed the leaf path down by the LATEST name, bypassing the per-file field-id normalization: after a sub-field rename the old file's leaf read NULL, and after a sub-field type change old and new batches carried different types and failed to concatenate. Mirror the merge path instead: widen the projection to the full top-level columns so the field-id normalization applies (rename follows the id, missing sub-fields pad NULL, types are cast), then extract the requested leaf paths back to the user's flat schema - batch-level via NestedLeafBatchReader, or row-level via OuterProjectionRecordReader when a post-read filter is involved. Add regression tests projecting a renamed and a type-changed sub-field across old and new files.

update_column_type from ROW/ARRAY/MAP to STRING passes validation (the cast rules allow constructed types to character strings), but reading an old file failed with ArrowNotImplementedError because struct/list/map cannot be cast to utf8 directly. Render the string form during per-file alignment instead, matching the engine's cast rules: ROW as '{v1, v2}', ARRAY as '[e1, e2]', MAP as '{k1 -> v1, k2 -> v2}', with sub-values rendered recursively, NULL sub-values as the literal 'null', and NULL containers staying NULL. Add round-trip tests for ROW/ARRAY/MAP to STRING, NULL semantics, and a nested sub-field changed to STRING.

…kens Two follow-ups on the nested schema-evolution path: - update_column_type from VECTOR (or MULTISET) to STRING passed validation but old files failed on read: there is no string rendering for them. Narrow the cast rule so only ROW/ARRAY/MAP - the constructed types the read path can render - are accepted as string sources. - The nested path walker consumed the ARRAY/MAP wrapper token by position without checking it, so an invalid path like ['arr', 'wrong', 'c'] was accepted and mutated the schema exactly like ['arr', 'element', 'c']. Require 'element' for arrays and 'value' for maps before descending. Add tests for the rejected vector alter (the column still reads), the narrowed cast rules, and wrong wrapper tokens on ARRAY<ROW> / MAP<.,ROW>.

…sliced arrays, gate null-to-not-null Self-review findings on the nested schema-evolution path: - update_column_type between same-root constructed types (e.g. ROW<a INT> -> ROW<a BIGINT, c STRING>) was accepted: the replacement carried caller-supplied nested ids that corrupt the id model and old rows read all-NULL; a VECTOR length change was accepted but unreadable. Reject non-identical constructed-to-constructed casts - reshaping goes through sub-field / 'element' / 'value' paths, which keep working. - The list/map rebuilds in the alignment and string-rendering paths read offsets/raw buffers directly, which errors on a sliced ListArray and silently misaligns rows on a sliced MapArray; re-materialize sliced inputs first. - Converting a nullable column to NOT NULL was silently accepted; it is now rejected by default and opt-in via 'alter-column-null-to-not-null.disabled' = 'false'. Also add an end-to-end test for the array 'element' type promotion path.

TheR1sing3un marked this pull request as ready for review June 10, 2026 02:25

JingsongLi reviewed Jun 10, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/read/reader/data_file_batch_reader.py

TheR1sing3un requested a review from JingsongLi June 10, 2026 07:47

JingsongLi reviewed Jun 11, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/casting/data_type_casts.py Outdated

TheR1sing3un requested a review from JingsongLi June 11, 2026 03:00

JingsongLi reviewed Jun 11, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/read/reader/data_file_batch_reader.py

JingsongLi reviewed Jun 11, 2026

View reviewed changes

Comment thread paimon-python/pypaimon/schema/schema_manager.py Outdated

TheR1sing3un added 2 commits June 11, 2026 13:59

TheR1sing3un requested a review from JingsongLi June 11, 2026 08:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Support schema evolution of nested struct sub-fields#8187

[python] Support schema evolution of nested struct sub-fields#8187
TheR1sing3un wants to merge 5 commits into
apache:masterfrom
TheR1sing3un:python-nested-schema-evolution

TheR1sing3un commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheR1sing3un commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheR1sing3un commented Jun 10, 2026 •

edited

Loading