Skip to content

[python] Support schema evolution of nested struct sub-fields#8187

Open
TheR1sing3un wants to merge 5 commits into
apache:masterfrom
TheR1sing3un:python-nested-schema-evolution
Open

[python] Support schema evolution of nested struct sub-fields#8187
TheR1sing3un wants to merge 5 commits into
apache:masterfrom
TheR1sing3un:python-nested-schema-evolution

Conversation

@TheR1sing3un

@TheR1sing3un TheR1sing3un commented Jun 10, 2026

Copy link
Copy Markdown
Member

Purpose

Follow-up to #8126, which made read-time schema evolution align top-level columns by field id. This extends the same id-based alignment to sub-fields inside a ROW (including a ROW nested in an ARRAY/MAP).

Before this PR, nested sub-field evolution didn't work: adding a sub-field silently created a top-level column, and rename/drop/update-type failed, because only the last name in the path was matched.

Now a dotted path like mv.value is resolved recursively, so for a column mv ROW<version BIGINT, value STRING>:

  • add a sub-field → old rows read NULL for it;
  • rename a sub-field → data follows the field id, not the name;
  • drop a sub-field → its old data is not revived;
  • update type of a sub-field → cast at read time.
# mv ROW<version BIGINT, value STRING>
catalog.alter_table("db.t", [SchemaChange.rename_column(["mv", "value"], "val")])
# files written under mv.value are read back as mv.val with the same data

How it works:

  • Nested sub-fields get globally-unique field ids at create time; highestFieldId is computed recursively so nested and top-level ids never collide.
  • Schema changes (add / rename / drop / update-type / update-nullability / update-comment) recurse along the field-name path, transparently through ARRAY/MAP wrappers.
  • update column type is validated against the cast-support rules.
  • The read path aligns nested sub-fields by id — reorder, pad missing with NULL, follow renames, cast changed types — recursing into struct / array / map.

Tests

New cases in schema_evolution_nested_read_test.py:

  • nested add / rename / drop / update-type read round-trips, on append-only and primary-key tables;
  • sub-fields of ARRAY<ROW> and MAP<.,ROW>;
  • the nested field-id model (global uniqueness, recursive highestFieldId, duplicate detection);
  • the type-cast support rules.

Read-time schema evolution previously aligned only top-level columns by field
id; sub-fields inside a ROW (and a ROW nested in an ARRAY/MAP) could not evolve:
adding one silently created a top-level column, and rename/drop/update-type
raised because the schema manager only handled the last path element.

- Assign globally-unique ids to nested sub-fields at create time and compute
  highestFieldId recursively, so nested ids never collide with top-level ones.
- Recurse schema changes along the dotted field-name path (transparently
  through ARRAY/MAP wrappers) for add/rename/drop/update-type/update-nullability/
  update-comment, allocating new ids from the persisted highestFieldId.
- Validate update-column-type against the cast-support rules.
- Align nested sub-fields by field id at read time: reorder, pad missing with
  NULL, follow renames, and cast changed types, recursing into struct/array/map.

Add tests covering nested add/rename/drop/update-type round-trips (append-only
and primary-key), ARRAY<ROW>/MAP<.,ROW> sub-fields, the id model, and the cast
rules.
@TheR1sing3un TheR1sing3un marked this pull request as ready for review June 10, 2026 02:25
Comment thread paimon-python/pypaimon/read/reader/data_file_batch_reader.py
Nested-leaf projection on append-only reads pushed the leaf path down by
the LATEST name, bypassing the per-file field-id normalization: after a
sub-field rename the old file's leaf read NULL, and after a sub-field type
change old and new batches carried different types and failed to
concatenate.

Mirror the merge path instead: widen the projection to the full top-level
columns so the field-id normalization applies (rename follows the id,
missing sub-fields pad NULL, types are cast), then extract the requested
leaf paths back to the user's flat schema - batch-level via
NestedLeafBatchReader, or row-level via OuterProjectionRecordReader when a
post-read filter is involved.

Add regression tests projecting a renamed and a type-changed sub-field
across old and new files.
@TheR1sing3un TheR1sing3un requested a review from JingsongLi June 10, 2026 07:47
Comment thread paimon-python/pypaimon/casting/data_type_casts.py Outdated
update_column_type from ROW/ARRAY/MAP to STRING passes validation (the
cast rules allow constructed types to character strings), but reading an
old file failed with ArrowNotImplementedError because struct/list/map
cannot be cast to utf8 directly.

Render the string form during per-file alignment instead, matching the
engine's cast rules: ROW as '{v1, v2}', ARRAY as '[e1, e2]', MAP as
'{k1 -> v1, k2 -> v2}', with sub-values rendered recursively, NULL
sub-values as the literal 'null', and NULL containers staying NULL.

Add round-trip tests for ROW/ARRAY/MAP to STRING, NULL semantics, and a
nested sub-field changed to STRING.
@TheR1sing3un TheR1sing3un requested a review from JingsongLi June 11, 2026 03:00
Comment thread paimon-python/pypaimon/read/reader/data_file_batch_reader.py
Comment thread paimon-python/pypaimon/schema/schema_manager.py Outdated
…kens

Two follow-ups on the nested schema-evolution path:

- update_column_type from VECTOR (or MULTISET) to STRING passed validation
  but old files failed on read: there is no string rendering for them.
  Narrow the cast rule so only ROW/ARRAY/MAP - the constructed types the
  read path can render - are accepted as string sources.

- The nested path walker consumed the ARRAY/MAP wrapper token by position
  without checking it, so an invalid path like ['arr', 'wrong', 'c'] was
  accepted and mutated the schema exactly like ['arr', 'element', 'c'].
  Require 'element' for arrays and 'value' for maps before descending.

Add tests for the rejected vector alter (the column still reads), the
narrowed cast rules, and wrong wrapper tokens on ARRAY<ROW> / MAP<.,ROW>.
…sliced arrays, gate null-to-not-null

Self-review findings on the nested schema-evolution path:

- update_column_type between same-root constructed types (e.g. ROW<a INT>
  -> ROW<a BIGINT, c STRING>) was accepted: the replacement carried
  caller-supplied nested ids that corrupt the id model and old rows read
  all-NULL; a VECTOR length change was accepted but unreadable. Reject
  non-identical constructed-to-constructed casts - reshaping goes through
  sub-field / 'element' / 'value' paths, which keep working.

- The list/map rebuilds in the alignment and string-rendering paths read
  offsets/raw buffers directly, which errors on a sliced ListArray and
  silently misaligns rows on a sliced MapArray; re-materialize sliced
  inputs first.

- Converting a nullable column to NOT NULL was silently accepted; it is
  now rejected by default and opt-in via
  'alter-column-null-to-not-null.disabled' = 'false'.

Also add an end-to-end test for the array 'element' type promotion path.
@TheR1sing3un TheR1sing3un requested a review from JingsongLi June 11, 2026 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants