feat(core): add Substrait dialect support#861
Open
nielspardon wants to merge 4 commits into
Open
Conversation
Add a typed model in io.substrait.dialect for creating and consuming Substrait dialect YAML files, faithful to substrait/text/dialect_schema.yaml (introduced in substrait v0.76.0). The model mirrors the SimpleExtension pattern: a @Value.Enclosing Dialect holder with nested Immutables types and Jackson (de)serialization. The three polymorphic unions (supported_types/relations/expressions) use an enum-tag class per category with typed config fields, and custom (de)serializers that collapse config-free entries to bare enum strings and expand configured ones to objects. Config sub-enums are dialect-local to keep the dialect vocabulary decoupled from the algebra model. Schema validation is test-scope only (networknt json-schema-validator), so the published core jar gains no new runtime dependency. Tests cover a schema-validated round-trip, bare-string collapse, parsing the published spark_dialect.yaml, and the per-section dialect fixtures from the spec repo.
bestbeforetoday
left a comment
Member
There was a problem hiding this comment.
The programmatic building of a Dialect looks really nice. A whole load of inline comments; mostly very minor suggestions or queries.
Promote the nested dialect types out of the @Value.Enclosing Dialect holder into top-level classes in io.substrait.dialect, and rename DialectDocument to Dialect (now carrying the builder/load/toYaml factory methods). This removes the io.substrait.dialect.Dialect.DialectDocument repetition. Also addresses the remaining review comments: - share a single package-scoped ObjectMapper instead of creating one per call, and have the union (de)serializers use it directly rather than casting JsonParser#getCodec() - read directly from InputStream/File instead of slurping into a String; the InputStream overload no longer closes a caller-owned stream - accept a single scalar as a one-element list in readEnums/readStrings, consistent with ACCEPT_SINGLE_VALUE_AS_ARRAY - reject metadata on EXECUTION_CONTEXT_VARIABLE and forbid setting both writeTypes and ddlWriteTypes (they share the write_types field), so entries always round-trip - make the (de)serializer classes package-private - use InputStream.readAllBytes() in the spec-fixtures test 🤖 Generated with AI
The upstream main now runs `core:javadoc` with `-Xdoclint:all -Xwerror` (added in substrait-io#949), which fails the build on any missing-javadoc warning. Flattening the dialect model into top-level public types exposed their members to this check, so document every public enum constant and every public accessor / factory method (with @param/@return) following the existing convention (e.g. io.substrait.hint.Hint).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a typed model in the
coreJava SDK for creating and consuming Substrait dialect YAML files, faithful tosubstrait/text/dialect_schema.yaml(introduced in substrait v0.76.0). Until now the only producer of a dialect in this repo was the ScalaDialectGeneratorin thesparkmodule, whose ad-hoc model isn't reusable.Usage
Build a dialect, serialize it to YAML, and parse it back:
The configuration-free entries serialize as bare enum strings and the configured ones as mappings:
A fuller example covering every union and configuration option lives in
DialectRoundTripTest.Design
io.substrait.dialectpackage with a@Value.EnclosingDialectholder and nested Immutables types, mirroring the existingSimpleExtensionpattern (Jackson +@Value.Immutable, staticload(...)/toYaml(...)helpers).supported_types/supported_relations/supported_expressions) areoneOf [bare-enum-string | config-object]in the schema. Each is modeled as one enum-tag class per category (SupportedType/SupportedRelation/SupportedExpression) carrying a dialect-local kind enum plus typed config fields. Custom Jackson (de)serializers collapse config-free entries to a bare enum string and expand configured ones to objects.JoinType,SetOperation, ...) are dialect-local with exactly the schema's constants, keeping the dialect vocabulary decoupled from the relational-algebra model (whoseJoin.JoinType/Set.SetOpcarry extraUNKNOWN/deprecated values).Type/Rel/Expressionhierarchies model full algebra instances — the wrong abstraction level for capability tags — so they are intentionally not reused for the kind enums.Validation & tests
networknt json-schema-validator); the publishedcorejar gains no new runtime dependency.processTestResourcescopies the dialect schema, the publishedspark_dialect.yaml, and the spec's per-section dialect fixtures onto the test classpath.types,relations,expressions,functions,execution_behavior).The
sparkmodule is left unchanged; migrating itsDialectGeneratoronto this model is a natural follow-up.🤖 Generated with AI