[flink] Expose scan.bucket for single-bucket manifest pruning#8117
[flink] Expose scan.bucket for single-bucket manifest pruning#8117wwj6591812 wants to merge 1 commit into
Conversation
17c2722 to
7e8c5d8
Compare
|
The failed test is not related to my modifications. |
|
The validation here still allows
Could we enforce that here, e.g. require the fixed bucket mode and configured bucket count > 0, and also check primary-key-ness if the option is intended only for primary-key tables? |
7e8c5d8 to
9b67691
Compare
Hi, thanks for your review. Please CC, Thx. |
| if (limit != null) { | ||
| readBuilder.withLimit(limit.intValue()); | ||
| } | ||
| ScanBucketUtils.applyScanBucket(table, readBuilder, conf); |
There was a problem hiding this comment.
This applies scan.bucket only to the normal source read builder. Aggregate pushdown plans splits through AggregatePushDownUtils.planSplits(...) with a separate ReadBuilder, so queries such as SELECT COUNT() FROM T /+ OPTIONS(scan.bucket=0) */ can still aggregate all buckets while non-aggregate reads scan only the requested bucket. Please either apply SCAN_BUCKET in the aggregate pushdown planning path as well, or disable aggregate pushdown when scan.bucket is set.
There was a problem hiding this comment.
Thanks for the review, @JingsongLi.
I have applied ScanBucketUtils.applyScanBucket in AggregatePushDownUtils.planSplits(...) as well, so the aggregate pushdown planning path now respects the scan.bucket option too.
| "Bucket scan is only supported for fixed-bucket tables, but got bucket mode %s.", | ||
| fileStoreTable.bucketMode()); | ||
| checkArgument( | ||
| !fileStoreTable.schema().primaryKeys().isEmpty(), |
There was a problem hiding this comment.
This makes the public core ReadBuilder.withBucket(...) API reject fixed-bucket append tables, even though bucket-level manifest pruning is useful and valid there too. If the primary-key restriction is only meant for the Flink scan.bucket option, could we keep ReadBuilder.withBucket(...) generic and enforce the Flink option restriction in the Flink scan.bucket path instead?
There was a problem hiding this comment.
Thanks for the review, @JingsongLi.
I have removed the primary-key restriction from ReadBuilderImpl.validateSpecifiedBucket(...) to keep the core ReadBuilder.withBucket(...) API generic for fixed-bucket append tables. The primary-key check is now enforced inside ScanBucketUtils.applyScanBucket(...), which is the Flink scan.bucket option path. I also updated ReadBuilderImplTest accordingly.
db92906 to
3b01bc0
Compare
3b01bc0 to
9793995
Compare
Background
ReadBuilder.withBucket(int) and manifest scanning already support reading a single bucket, but Flink SQL had no connector option to expose it. Operators often need to debug or scan one bucket of a fixed-bucket primary-key table without reading all buckets.
Why this PR
Expose scan.bucket in Flink so users can run:
SELECT * FROM t /*+ OPTIONS('scan.bucket' = '0') */
and plan splits only for that bucket.
What changes
Stage optimized: scan / manifest planning — fewer manifest entries and splits before read. No change to merge or per-record logic.
Tests
Test plan