feat: Support IEEE 754 negative zero semantics#22835
Conversation
|
run benchmark tpch tpcds |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing array_nan (1a564ec) to 883c38e (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing array_nan (1a564ec) to 883c38e (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
run benchmark tpch tpcds |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing array_nan (beb8750) to 883c38e (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing array_nan (beb8750) to 883c38e (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
mbutrovich
left a comment
There was a problem hiding this comment.
Thanks for tackling this, @comphead!
One thing I wanted to ask about: optimizing for the common case where an array has no -0.0.
normalize_float_zero allocates unconditionally
PrimitiveArray::unary always allocates a fresh values buffer of length n, even when there's no -0.0 to fix — which is the overwhelmingly common case. This now sits on some hot paths:
apply_cmp— every float=/</>/<=/>=/IS [NOT] DISTINCTcomparison, on both operands.eq_dyn_null— float hash-join key verification.GroupValuesRows::intern— a deep copy of each float column on every batch.
Would it be worth gating the copy behind a SIMD-reducible pre-scan, so we only allocate when a -0.0 is actually present?
// f64 case; sign-bit-only pattern == -0.0
let needs = arr.values().iter().any(|v| v.to_bits() == 0x8000_0000_0000_0000);
if !needs { return Arc::clone(array); }.any() over a contiguous slice should vectorize to a read-only OR-reduction, so the common case drops from "allocate + write n" to just a scan, and we pay the unary cost only when there's something to normalize. The per-native-value canonicalize path in the primitive group-by builders already avoids allocation entirely — this would bring the array path closer to that.
Small question on coverage
The fix covers hash joins via eq_dyn_null + create_hashes, but JoinKeyComparator builds via make_comparator (arrow totalOrder), which isn't normalized. Is sort-merge join reachable for float equi-keys, and if so is it covered? The current tests look like they exercise the hash-join plan only.
Either way, this is a solid fix — mostly just wondering whether we can keep the no--0.0 path allocation-free.
| /// (PostgreSQL / IEEE 754 equality) require them to compare equal, so | ||
| /// callers normalize before invoking those kernels. | ||
| /// | ||
| /// The common case — no `-0.0` present — is allocation-free: a single |
There was a problem hiding this comment.
Parenthetical in em dashes is awkward.
mbutrovich
left a comment
There was a problem hiding this comment.
Minor nit, thanks @comphead.
|
Thanks @mbutrovich for the review |
|
ive only skimmed this, but i wonder how we'll enforce/ensure we remember to apply this normalization each time for different operation/functions/execution paths 🤔 for example, this fix isn't applied to > select array_replace([0.0, -0.0], 0.0, 1.0);
+-------------------------------------------------------------------------+
| array_replace(make_array(Float64(0),Float64(-0)),Float64(0),Float64(1)) |
+-------------------------------------------------------------------------+
| [1.0, -0.0] |
+-------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.postgres for comparison: postgres=# select array_replace(ARRAY[0.0, -0.0], 0.0, 1.0);
array_replace
---------------
{1.0,1.0}
(1 row) |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#22826 . - Closes apache#22490 - Closes apache#11108 ## Rationale for this change SQL (per PG and IEEE 754) treats `+0.0` and `-0.0` as equal in `=`, `IS DISTINCT FROM`, DISTINCT, GROUP BY, UNION/INTERSECT/EXCEPT, equi-joins, and `array_*` set ops. DataFusion treated them as distinct because: - Arrow's `cmp::eq`/`gt`/`lt` use IEEE 754 totalOrder for floats — `arrow-ord-58.3.0/src/cmp.rs:71-75` explicitly says *"please normalize zeros before calling this kernel"*. - Arrow's `RowConverter` row-encodes floats with totalOrder; ±0 produce different bytes. - DataFusion's primitive float hashing used raw `to_bits()` / `to_ne_bytes()`, so ±0 hashed to different buckets. ## What changes are included in this PR? **Helper** in `datafusion/common/src/utils/mod.rs`: - `normalize_float_zero(&ArrayRef) -> ArrayRef` — rewrites `-0.0 → +0.0` for Float16/32/64 via `PrimitiveArray::unary`; `Arc::clone` for non-float. NaN payloads preserved (`bits << 1 == 0` matches only ±0). - `normalize_float_zero_scalar(ScalarValue) -> ScalarValue` — symmetric for scalars. **Applied at six boundary sites** where DataFusion hands float data to Arrow: | Site | Fixes | |---|---| | `physical-expr-common/src/datum.rs::apply_cmp` | BinaryExpr `=`, `<`, `>`, `IS DISTINCT FROM` | | `physical-plan/src/joins/utils.rs::eq_dyn_null` | HashJoin row equality → INNER JOIN, INTERSECT, EXCEPT | | `physical-plan/src/aggregates/group_values/row.rs::intern` | Multi-column row-encoded GROUP BY | | `functions-nested/src/set_ops.rs::general_array_distinct` | `array_distinct` | | `functions-nested/src/set_ops.rs::generic_set_lists` | `array_union`, `array_intersect` | | `functions-nested/src/except.rs::general_except` | `array_except` | **Float hash macros** normalize for consistency: - `datafusion/common/src/hash_utils.rs::hash_float_value!` — `create_hashes`, hash joins, shuffle. - `datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs::hash_float!` — single-column primitive GROUP BY fast path. **Single-column primitive GROUP BY / DISTINCT** (`GroupValuesPrimitive::intern`, `PrimitiveGroupValueBuilder::{append_val,vectorized_append,equal_to,vectorized_equal_to_*}`) canonicalize the input via a new default-identity `canonicalize` method on the local `HashValue` trait (float override only). Trait visibility lifted to `pub` so the multi-column file can use it.
Which issue does this PR close?
Rationale for this change
SQL (per PG and IEEE 754) treats
+0.0and-0.0as equal in=,IS DISTINCT FROM, DISTINCT, GROUP BY, UNION/INTERSECT/EXCEPT, equi-joins, andarray_*set ops. DataFusion treated them asdistinct because:
cmp::eq/gt/ltuse IEEE 754 totalOrder for floats —arrow-ord-58.3.0/src/cmp.rs:71-75explicitly says "please normalize zeros before calling this kernel".RowConverterrow-encodes floats with totalOrder; ±0 produce different bytes.to_bits()/to_ne_bytes(), so ±0 hashed to different buckets.What changes are included in this PR?
Helper in
datafusion/common/src/utils/mod.rs:normalize_float_zero(&ArrayRef) -> ArrayRef— rewrites-0.0 → +0.0for Float16/32/64 viaPrimitiveArray::unary;Arc::clonefor non-float. NaN payloads preserved (bits << 1 == 0matches only±0).
normalize_float_zero_scalar(ScalarValue) -> ScalarValue— symmetric for scalars.Applied at six boundary sites where DataFusion hands float data to Arrow:
physical-expr-common/src/datum.rs::apply_cmp=,<,>,IS DISTINCT FROMphysical-plan/src/joins/utils.rs::eq_dyn_nullphysical-plan/src/aggregates/group_values/row.rs::internfunctions-nested/src/set_ops.rs::general_array_distinctarray_distinctfunctions-nested/src/set_ops.rs::generic_set_listsarray_union,array_intersectfunctions-nested/src/except.rs::general_exceptarray_exceptFloat hash macros normalize for consistency:
datafusion/common/src/hash_utils.rs::hash_float_value!—create_hashes, hash joins, shuffle.datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs::hash_float!— single-column primitive GROUP BY fast path.Single-column primitive GROUP BY / DISTINCT (
GroupValuesPrimitive::intern,PrimitiveGroupValueBuilder::{append_val,vectorized_append,equal_to,vectorized_equal_to_*}) canonicalize the input viaa new default-identity
canonicalizemethod on the localHashValuetrait (float override only). Trait visibility lifted topubso the multi-column file can use it.