[WIP][SPARK-55206][PYTHON][SQL] Transpilation minimal functional implementation with python by holdenk · Pull Request #56327 · apache/spark

holdenk · 2026-06-04T16:30:25Z

What changes were proposed in this pull request?

This is the first PR adding initial transpilation from Python to Catalyst and the framework for others to plug in different transpilers which can target other targets. This replaces #53547 and is the first PR as part of the transpilation SPIP https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?tab=t.0#heading=h.iz92aps6qbo5

Why are the changes needed?

Python UDF performance has been a perinial challenge in Spark, and while PyArrow makes the data copy slightly less expensive it's still much slower than native code. Additionally, Python UDF usage has increased substantailly with Pandas on Spark and it is especially hard for folks to write catalyst expressions here.

Does this PR introduce any user-facing change?

New configuration flags are user visible (default to off).

How was this patch tested?

New unit tests that trigger always and hypothesis tests that trigger only selectively.

Was this patch authored or co-authored using generative AI tooling?

Hypothesis teests were generated by Claude 4.7, eq semantics null and truthiness w/claude 4.6, GH code review (unknown backing model) on itiial draft). snake camel swap with sonnet.

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…file Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… class but reset the python worker logs between Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… and transpialtion subsequently disabled or ANSI mode swapped and instead just ignore all of the transpiled elems Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…s still there or not. Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

* [SPARK][PYTHON] Refuse to lower non-boolean and/or/not in UDF transpiler; fix Column truthiness in transpile path; surface column in CANNOT_CONVERT_COLUMN_INTO_BOOL The UDF transpiler previously lowered Python `and` / `or` / `not` to Spark's bitwise `&` / `|` / `~` regardless of operand type. For non-bool operands (e.g. `x or 0` against an int column) this silently diverges from Python's truthiness semantics. Detect statically-non-boolean operands (numeric / string literals, arithmetic BinOps, USub/UAdd, IfExp with both arms non-boolean) and refuse to lower; the UDF then falls back to interpreted Python. Also: - Fix `if transpiled_column:` triggering CANNOT_CONVERT_COLUMN_INTO_BOOL during transpilation (which broke test_udf_transpile_basic). Use `is not None` instead. - Include the offending column's repr in the CANNOT_CONVERT_COLUMN_INTO_BOOL error so users can tell which column triggered it. - Add unit tests for boolean and/or transpilation, non-boolean fallback, and the improved error message; add hypothesis tests for both_positive / either_positive. * Fix ruff F841 and CI test failures in transpiler boolean tests - Remove leftover unused ``transpiled_param_names`` local in udf.py (the class attribute ``self._transpiled_param_names`` already covers this; the local was a leftover from an earlier refactor). - Rewrite the unit-test UDFs ``both_positive`` / ``either_positive`` / the matching hypothesis UDFs as single-statement bodies so the transpiler (which only handles one top-level statement) actually lowers them. The previous form short-circuited out before the BoolOp path was exercised. - Stop relying on schema inference for None-only Row inputs in ``test_udf_transpile_falls_back_for_non_boolean_short_circuit``. Pass an explicit LongType schema so ``createDataFrame`` doesn't fail with CANNOT_DETERMINE_TYPE on the ``or_zero_none`` case. - Use a non-null long strategy for the boolean hypothesis tests since the interpreted Python body would raise on a None operand. * Fix ruff format issues in udf.py and test_udf_transpile_unit.py The previous lint failure was masked by ruff check failing first (F841 unused variable). With check passing, ruff format now flags the surrounding region: a stray blank line in the unit test file and a multi-line conditional in udf.py that fits on one line. Apply ruff format auto-fix. * Support comparison operators in transpiler; fix mypy FunctionDef typing Two CI fixes: 1. The new ``test_udf_transpile_boolean_and_or_lowered`` test exercises ``x > 0 and y > 0``, but the transpiler's Compare handler only covered ``Is`` / ``IsNot``. Add Eq, NotEq, Lt, LtE, Gt, GtE so the bool-typed and/or path actually transpiles instead of bailing on the first comparison. Drop the now-stale ``less_than_zero`` case from the ``falls_back_for_unsupported_patterns`` matrix. 2. mypy 1.8.0 (CI) couldn't pick the right ``ast.FunctionDef`` overload from kwargs when ``decorator_list=[]`` was an unannotated empty literal (it inferred ``list[Never]``). Annotate the intermediate list locals so the call resolves to the documented overload. * Drop ast.FunctionDef call to Any to dodge mypy overload mismatch The typed overloads for ``ast.FunctionDef`` in mypy 1.8.0's typeshed require keyword-only ``type_params`` on Python 3.12+, but that field isn't accepted at runtime on every supported Python (it was added in 3.12, before that the constructor rejects it). Setting the constructor to ``Any`` keeps the runtime behaviour unchanged while sidestepping the typeshed overload resolution that the previous typed-locals approach couldn't satisfy. * Raise on NULL in transpiled value comparisons; restore lt test and exercise None Python raises ``TypeError`` when an operand of ``<``, ``<=``, ``>``, ``>=``, ``==`` or ``!=`` is ``None``, but Spark's three-valued logic silently returns ``NULL`` -- a silent semantic divergence. The transpiler now wraps these comparisons with ``when(left.isNull() | right.isNull(), raise_error(...)).otherwise(...)`` so the rewritten plan fails loudly to match the Python source. Code that guards against NULL upstream (``if x is not None: x > 0``) hits the otherwise branch and is unaffected. Tests: - Unit: ``test_udf_transpile_less_than_zero`` restores the ``Lt`` case as a positive transpile assertion (previously in the unsupported-patterns matrix); ``test_udf_transpile_compare_with_none_raises`` pins the new raise behaviour for unguarded comparisons. - Hypothesis: ``_run`` now treats "both sides raised" as equivalent so the boolean hypothesis tests (``both_positive`` / ``either_positive``) can be re-armed with the full nullable ``_long_strategy`` and a ``_BOOLEAN_PAIR_EDGES`` tuple that explicitly seeds NULL combos. * Address Copilot review comments - ``_lower_eq``: Python's ``None == x`` / ``None != x`` returns a bool, not a TypeError. Stop routing ``ast.Eq`` / ``ast.NotEq`` through the raise-on-NULL guard and instead emit the four ``when`` branches matching Python's equality semantics (both NULL -> True/False, one NULL -> False/True, otherwise -> ``==`` / ``!=``). The ``raise_error`` guard now only fires on ordering comparisons (``<``, ``<=``, ``>``, ``>=``). - ``_is_definitely_non_boolean``: narrow the ``ast.BinOp`` match from a bare catch-all to the arithmetic / shift operators we actually want to flag. Bitwise ``&`` / ``|`` / ``^`` are deliberately left out so e.g. ``not ((x > 0) & (y > 0))`` keeps being recognised as boolean. - ``Column.__nonzero__`` (connect): use ``repr(self._expr)`` rather than the dunder directly, in keeping with the standard API. - Hoist ``_BOOLEAN_PAIR_EDGES`` to module-level alongside ``_LONG_PAIR_EDGES`` so the decorator references a stable module attribute rather than a class-body local. --------- Co-authored-by: Claude <noreply@anthropic.com>

…improve warnings. Done with Claude

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…l in Scala for parent is udf and reduce scope remove some intermedite vals for debugging we don't need anymore, and use MDC logging.

holdenk · 2026-06-04T17:49:02Z

CC @devin-petersohn would love your input and review, I know it's still in draft phase though. I'm going to try and finish it up over the coming week

gaogaotiantian · 2026-06-04T19:21:25Z

I have a question - how correct and maintainable we want this to be? Python ast modules changes from version to version (not like huge changes but they do change. New nodes will be created and old nodes can be deprecated). We need to support a wide range of Python versions, do we build multiple versions of ast transpilers for different python versions?

Also, do we expect the users to "try it out" and see if it kind of works, or we aim to be "always correct or raising an error"? There's a huge difference.

For example:

        case ast.Compare():
            # All comparison operators produce bool.
            return True

Nope, pd.Series([1]) < pd.Series([0]) returns a pd.Series.

        case ast.BoolOp():
            # and / or of booleans produces bool.
            return True

Nope, 0 or 2 == 2

The non-boolean check for operations are not fully valid either - any user defined type can return a True for + - which is not that uncommon.

Like I mentioned when this SPIP was firstly discussed - Python is a super dynamic language and having a transpiler that is always correct on where it claims to be correct is not easy. So I want to confirm the purpose of the SPIP - do we want to be always correct, or correct most of the time to cover more transpilable UDFs, with the cost that we could produce wrong result?

holdenk · 2026-06-05T03:21:31Z

For ast module changes I think we're ok with just not transpiling new / different items. For the comparison operators right now we're limiting the UDF transpilation logic to scalar inputs we can scope it down more to also exclude non scalar captures (and expand back out later if we want for arrow / pandas types but handle them explicitly). To be clear this is. WIP PR but was talking with Scott about this in the context of some of his work so I wanted to raise the current draft for folks while
I'm finishing it up / getting it ready for a full review over the coming week.

sfc-gh-hkarau and others added 30 commits June 1, 2026 19:50

Add basic Python UDF transpilation

35864eb

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Drop comment, we've factored the transpile tests out into a seperate …

531ceae

…file Co-authored-by: Holden Karau <holden@pigscanfly.ca>

We use REusedSQLTestCase which reuses the SparkSession for the entire…

17226db

… class but reset the python worker logs between Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Style fixes w/ruff

6e41a3e

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Avoid leaving us with unevaluable nodes incase they somehow get added…

2b7360c

… and transpialtion subsequently disabled or ANSI mode swapped and instead just ignore all of the transpiled elems Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Add missing tree pattern

ff8ac69

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Improve udf unit tests to check the physical plan to see if the UDF i…

7cb6854

…s still there or not. Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Style

64dfa1e

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Fix type errors

f607f66

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Fix hypothesis

0c3c597

Fix self referential type

4493b8e

eh 4.2 is too tight, target 4.3 also add JIRAs to the follow ups and …

0f29f62

…improve warnings. Done with Claude

Better truthiness

cc627b3

Fix ansimode in type coercion since it is required to be on.

a4952d7

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix missing space

3ea0745

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Update truthiness

4d25820

Apply suggestions from code review

717718b

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Improve truthiness handling

a1e1da8

Improve coverage of transpilation via hypothesis

c79dc41

Improve coverage of transpilation via hypothesis

8dc5aec

Add eq semantics unit test

238baa7

Better error message on UDFPython fucntion invalid param name

70f4225

Cleanup message

3bc4b4c

Improve test coverage fix nested transpilation.

cd55bb1

Some small style fixes.

2a3adc2

Ask claude to do a review, caught the snake_case and swapped to camme…

c3ea150

…l in Scala for parent is udf and reduce scope remove some intermedite vals for debugging we don't need anymore, and use MDC logging.

Back out bad ruff formatting changes

c3f6183

Back out bad test_melt ruff format.

5ae933b

More bad format

7e275bc

sfc-gh-hkarau added 7 commits June 1, 2026 20:07

more bad format

b21cdd7

ugh formats

1642330

Add internal error condition for placeholder errors.

c190b7f

Add helpers for invalid placeholders

d2279bc

Use the helper

f06d750

Testing to make sure agg UDAF parties on

56ac3bb

Limit transpilation for now.

0b7a81f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-55206][PYTHON][SQL] Transpilation minimal functional implementation with python#56327

[WIP][SPARK-55206][PYTHON][SQL] Transpilation minimal functional implementation with python#56327
holdenk wants to merge 37 commits into
apache:masterfrom
holdenk:r3-SPARK-55206-transpilation-minimal-functional-impl-python

holdenk commented Jun 4, 2026 •

edited

Loading

Uh oh!

holdenk commented Jun 4, 2026

Uh oh!

gaogaotiantian commented Jun 4, 2026

Uh oh!

holdenk commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

holdenk commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

holdenk commented Jun 4, 2026

Uh oh!

gaogaotiantian commented Jun 4, 2026

Uh oh!

holdenk commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

holdenk commented Jun 4, 2026 •

edited

Loading