Skip to content

[WIP][SPARK-55206][PYTHON][SQL] Transpilation minimal functional implementation with python#56327

Draft
holdenk wants to merge 37 commits into
apache:masterfrom
holdenk:r3-SPARK-55206-transpilation-minimal-functional-impl-python
Draft

[WIP][SPARK-55206][PYTHON][SQL] Transpilation minimal functional implementation with python#56327
holdenk wants to merge 37 commits into
apache:masterfrom
holdenk:r3-SPARK-55206-transpilation-minimal-functional-impl-python

Conversation

@holdenk
Copy link
Copy Markdown
Contributor

@holdenk holdenk commented Jun 4, 2026

What changes were proposed in this pull request?

This is the first PR adding initial transpilation from Python to Catalyst and the framework for others to plug in different transpilers which can target other targets. This replaces #53547 and is the first PR as part of the transpilation SPIP https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?tab=t.0#heading=h.iz92aps6qbo5

Why are the changes needed?

Python UDF performance has been a perinial challenge in Spark, and while PyArrow makes the data copy slightly less expensive it's still much slower than native code. Additionally, Python UDF usage has increased substantailly with Pandas on Spark and it is especially hard for folks to write catalyst expressions here.

Does this PR introduce any user-facing change?

New configuration flags are user visible (default to off).

How was this patch tested?

New unit tests that trigger always and hypothesis tests that trigger only selectively.

Was this patch authored or co-authored using generative AI tooling?

Hypothesis teests were generated by Claude 4.7, eq semantics null and truthiness w/claude 4.6, GH code review (unknown backing model) on itiial draft). snake camel swap with sonnet.

sfc-gh-hkarau and others added 30 commits June 1, 2026 19:50
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…file

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… class but reset the python worker logs between

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… and transpialtion subsequently disabled or ANSI mode swapped and instead just ignore all of the transpiled elems

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…s still there or not.

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
* [SPARK][PYTHON] Refuse to lower non-boolean and/or/not in UDF transpiler; fix Column truthiness in transpile path; surface column in CANNOT_CONVERT_COLUMN_INTO_BOOL

The UDF transpiler previously lowered Python `and` / `or` / `not` to
Spark's bitwise `&` / `|` / `~` regardless of operand type. For non-bool
operands (e.g. `x or 0` against an int column) this silently diverges
from Python's truthiness semantics. Detect statically-non-boolean
operands (numeric / string literals, arithmetic BinOps, USub/UAdd,
IfExp with both arms non-boolean) and refuse to lower; the UDF then
falls back to interpreted Python.

Also:
- Fix `if transpiled_column:` triggering CANNOT_CONVERT_COLUMN_INTO_BOOL
  during transpilation (which broke test_udf_transpile_basic). Use
  `is not None` instead.
- Include the offending column's repr in the
  CANNOT_CONVERT_COLUMN_INTO_BOOL error so users can tell which column
  triggered it.
- Add unit tests for boolean and/or transpilation, non-boolean
  fallback, and the improved error message; add hypothesis tests for
  both_positive / either_positive.

* Fix ruff F841 and CI test failures in transpiler boolean tests

- Remove leftover unused ``transpiled_param_names`` local in udf.py
  (the class attribute ``self._transpiled_param_names`` already covers
  this; the local was a leftover from an earlier refactor).
- Rewrite the unit-test UDFs ``both_positive`` / ``either_positive`` /
  the matching hypothesis UDFs as single-statement bodies so the
  transpiler (which only handles one top-level statement) actually
  lowers them. The previous form short-circuited out before the
  BoolOp path was exercised.
- Stop relying on schema inference for None-only Row inputs in
  ``test_udf_transpile_falls_back_for_non_boolean_short_circuit``.
  Pass an explicit LongType schema so ``createDataFrame`` doesn't
  fail with CANNOT_DETERMINE_TYPE on the ``or_zero_none`` case.
- Use a non-null long strategy for the boolean hypothesis tests
  since the interpreted Python body would raise on a None operand.

* Fix ruff format issues in udf.py and test_udf_transpile_unit.py

The previous lint failure was masked by ruff check failing first
(F841 unused variable). With check passing, ruff format now
flags the surrounding region: a stray blank line in the unit
test file and a multi-line conditional in udf.py that fits on
one line. Apply ruff format auto-fix.

* Support comparison operators in transpiler; fix mypy FunctionDef typing

Two CI fixes:

1. The new ``test_udf_transpile_boolean_and_or_lowered`` test exercises
   ``x > 0 and y > 0``, but the transpiler's Compare handler only
   covered ``Is`` / ``IsNot``. Add Eq, NotEq, Lt, LtE, Gt, GtE so the
   bool-typed and/or path actually transpiles instead of bailing on
   the first comparison. Drop the now-stale ``less_than_zero`` case
   from the ``falls_back_for_unsupported_patterns`` matrix.

2. mypy 1.8.0 (CI) couldn't pick the right ``ast.FunctionDef``
   overload from kwargs when ``decorator_list=[]`` was an unannotated
   empty literal (it inferred ``list[Never]``). Annotate the
   intermediate list locals so the call resolves to the documented
   overload.

* Drop ast.FunctionDef call to Any to dodge mypy overload mismatch

The typed overloads for ``ast.FunctionDef`` in mypy 1.8.0's typeshed
require keyword-only ``type_params`` on Python 3.12+, but that field
isn't accepted at runtime on every supported Python (it was added
in 3.12, before that the constructor rejects it). Setting the
constructor to ``Any`` keeps the runtime behaviour unchanged while
sidestepping the typeshed overload resolution that the previous
typed-locals approach couldn't satisfy.

* Raise on NULL in transpiled value comparisons; restore lt test and exercise None

Python raises ``TypeError`` when an operand of ``<``, ``<=``, ``>``,
``>=``, ``==`` or ``!=`` is ``None``, but Spark's three-valued logic
silently returns ``NULL`` -- a silent semantic divergence. The
transpiler now wraps these comparisons with
``when(left.isNull() | right.isNull(), raise_error(...)).otherwise(...)``
so the rewritten plan fails loudly to match the Python source. Code
that guards against NULL upstream (``if x is not None: x > 0``) hits
the otherwise branch and is unaffected.

Tests:
- Unit: ``test_udf_transpile_less_than_zero`` restores the ``Lt`` case
  as a positive transpile assertion (previously in the
  unsupported-patterns matrix); ``test_udf_transpile_compare_with_none_raises``
  pins the new raise behaviour for unguarded comparisons.
- Hypothesis: ``_run`` now treats "both sides raised" as equivalent so
  the boolean hypothesis tests (``both_positive`` / ``either_positive``)
  can be re-armed with the full nullable ``_long_strategy`` and a
  ``_BOOLEAN_PAIR_EDGES`` tuple that explicitly seeds NULL combos.

* Address Copilot review comments

- ``_lower_eq``: Python's ``None == x`` / ``None != x`` returns a
  bool, not a TypeError. Stop routing ``ast.Eq`` / ``ast.NotEq``
  through the raise-on-NULL guard and instead emit the four
  ``when`` branches matching Python's equality semantics (both
  NULL -> True/False, one NULL -> False/True, otherwise -> ``==``
  / ``!=``). The ``raise_error`` guard now only fires on ordering
  comparisons (``<``, ``<=``, ``>``, ``>=``).

- ``_is_definitely_non_boolean``: narrow the ``ast.BinOp`` match
  from a bare catch-all to the arithmetic / shift operators we
  actually want to flag. Bitwise ``&`` / ``|`` / ``^`` are
  deliberately left out so e.g. ``not ((x > 0) & (y > 0))`` keeps
  being recognised as boolean.

- ``Column.__nonzero__`` (connect): use ``repr(self._expr)``
  rather than the dunder directly, in keeping with the standard
  API.

- Hoist ``_BOOLEAN_PAIR_EDGES`` to module-level alongside
  ``_LONG_PAIR_EDGES`` so the decorator references a stable
  module attribute rather than a class-body local.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…l in Scala for parent is udf and reduce scope remove some intermedite vals for debugging we don't need anymore, and use MDC logging.
@holdenk
Copy link
Copy Markdown
Contributor Author

holdenk commented Jun 4, 2026

CC @devin-petersohn would love your input and review, I know it's still in draft phase though. I'm going to try and finish it up over the coming week

@gaogaotiantian
Copy link
Copy Markdown
Contributor

I have a question - how correct and maintainable we want this to be? Python ast modules changes from version to version (not like huge changes but they do change. New nodes will be created and old nodes can be deprecated). We need to support a wide range of Python versions, do we build multiple versions of ast transpilers for different python versions?

Also, do we expect the users to "try it out" and see if it kind of works, or we aim to be "always correct or raising an error"? There's a huge difference.

For example:

        case ast.Compare():
            # All comparison operators produce bool.
            return True

Nope, pd.Series([1]) < pd.Series([0]) returns a pd.Series.

        case ast.BoolOp():
            # and / or of booleans produces bool.
            return True

Nope, 0 or 2 == 2

The non-boolean check for operations are not fully valid either - any user defined type can return a True for + - which is not that uncommon.

Like I mentioned when this SPIP was firstly discussed - Python is a super dynamic language and having a transpiler that is always correct on where it claims to be correct is not easy. So I want to confirm the purpose of the SPIP - do we want to be always correct, or correct most of the time to cover more transpilable UDFs, with the cost that we could produce wrong result?

@holdenk
Copy link
Copy Markdown
Contributor Author

holdenk commented Jun 5, 2026

For ast module changes I think we're ok with just not transpiling new / different items. For the comparison operators right now we're limiting the UDF transpilation logic to scalar inputs we can scope it down more to also exclude non scalar captures (and expand back out later if we want for arrow / pandas types but handle them explicitly). To be clear this is. WIP PR but was talking with Scott about this in the context of some of his work so I wanted to raise the current draft for folks while
I'm finishing it up / getting it ready for a full review over the coming week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants