feat(core): add Substrait dialect support by nielspardon · Pull Request #861 · substrait-io/substrait-java

nielspardon · 2026-06-10T09:59:05Z

What

Adds a typed model in the core Java SDK for creating and consuming Substrait dialect YAML files, faithful to substrait/text/dialect_schema.yaml (introduced in substrait v0.76.0). Until now the only producer of a dialect in this repo was the Scala DialectGenerator in the spark module, whose ad-hoc model isn't reusable.

Usage

Build a dialect, serialize it to YAML, and parse it back:

import io.substrait.dialect.Dialect;
import io.substrait.dialect.Dialect.*;

DialectDocument dialect =
    DialectDocument.builder()
        .name("Example Dialect")
        .putDependencies("arithmetic", "extension:io.substrait:functions_arithmetic")
        // Types: a bare entry and a configured one.
        .addSupportedTypes(SupportedType.of(TypeKind.BOOL))
        .addSupportedTypes(
            SupportedType.builder().type(TypeKind.PRECISION_TIMESTAMP).maxPrecision(9).build())
        // Relations: a bare entry and one carrying configuration.
        .addSupportedRelations(SupportedRelation.of(RelationKind.FILTER))
        .addSupportedRelations(
            SupportedRelation.builder()
                .relation(RelationKind.JOIN)
                .addJoinTypes(JoinType.INNER, JoinType.LEFT)
                .build())
        // Functions reference a dependency alias.
        .addSupportedScalarFunctions(
            DialectFunction.builder()
                .source("arithmetic")
                .name("add")
                .systemMetadata(
                    SystemFunctionMetadata.builder().name(\"+\").notation(Notation.INFIX).build())
                .addSupportedImpls("i32_i32", "i64_i64")
                .build())
        .build();

String yaml = Dialect.toYaml(dialect);          // create
DialectDocument parsed = Dialect.load(yaml);     // consume

The configuration-free entries serialize as bare enum strings and the configured ones as mappings:

---
name: "Example Dialect"
dependencies:
  arithmetic: "extension:io.substrait:functions_arithmetic"
supported_types:
- "BOOL"
- type: "PRECISION_TIMESTAMP"
  max_precision: 9
supported_relations:
- "FILTER"
- relation: "JOIN"
  join_types:
  - "INNER"
  - "LEFT"
supported_scalar_functions:
- source: "arithmetic"
  name: "add"
  system_metadata:
    name: "+"
    notation: "INFIX"
  supported_impls:
  - "i32_i32"
  - "i64_i64"

A fuller example covering every union and configuration option lives in DialectRoundTripTest.

Design

New io.substrait.dialect package with a @Value.Enclosing Dialect holder and nested Immutables types, mirroring the existing SimpleExtension pattern (Jackson + @Value.Immutable, static load(...)/toYaml(...) helpers).
The three polymorphic unions (supported_types / supported_relations / supported_expressions) are oneOf [bare-enum-string | config-object] in the schema. Each is modeled as one enum-tag class per category (SupportedType/SupportedRelation/SupportedExpression) carrying a dialect-local kind enum plus typed config fields. Custom Jackson (de)serializers collapse config-free entries to a bare enum string and expand configured ones to objects.
Config sub-enums (JoinType, SetOperation, ...) are dialect-local with exactly the schema's constants, keeping the dialect vocabulary decoupled from the relational-algebra model (whose Join.JoinType/Set.SetOp carry extra UNKNOWN/deprecated values).
The existing Type/Rel/Expression hierarchies model full algebra instances — the wrong abstraction level for capability tags — so they are intentionally not reused for the kind enums.

Validation & tests

Schema validation is test-scope only (networknt json-schema-validator); the published core jar gains no new runtime dependency.
processTestResources copies the dialect schema, the published spark_dialect.yaml, and the spec's per-section dialect fixtures onto the test classpath.
Tests: a schema-validated build -> serialize -> validate -> parse -> assert-equal round-trip; bare-string collapse behavior; parsing/re-validating the real Spark dialect; and a parameterized round-trip over all five spec fixtures (types, relations, expressions, functions, execution_behavior).

The spark module is left unchanged; migrating its DialectGenerator onto this model is a natural follow-up.

🤖 Generated with AI

Add a typed model in io.substrait.dialect for creating and consuming Substrait dialect YAML files, faithful to substrait/text/dialect_schema.yaml (introduced in substrait v0.76.0). The model mirrors the SimpleExtension pattern: a @Value.Enclosing Dialect holder with nested Immutables types and Jackson (de)serialization. The three polymorphic unions (supported_types/relations/expressions) use an enum-tag class per category with typed config fields, and custom (de)serializers that collapse config-free entries to bare enum strings and expand configured ones to objects. Config sub-enums are dialect-local to keep the dialect vocabulary decoupled from the algebra model. Schema validation is test-scope only (networknt json-schema-validator), so the published core jar gains no new runtime dependency. Tests cover a schema-validated round-trip, bare-string collapse, parsing the published spark_dialect.yaml, and the per-section dialect fixtures from the spec repo.

bestbeforetoday

The programmatic building of a Dialect looks really nice. A whole load of inline comments; mostly very minor suggestions or queries.

Promote the nested dialect types out of the @Value.Enclosing Dialect holder into top-level classes in io.substrait.dialect, and rename DialectDocument to Dialect (now carrying the builder/load/toYaml factory methods). This removes the io.substrait.dialect.Dialect.DialectDocument repetition. Also addresses the remaining review comments: - share a single package-scoped ObjectMapper instead of creating one per call, and have the union (de)serializers use it directly rather than casting JsonParser#getCodec() - read directly from InputStream/File instead of slurping into a String; the InputStream overload no longer closes a caller-owned stream - accept a single scalar as a one-element list in readEnums/readStrings, consistent with ACCEPT_SINGLE_VALUE_AS_ARRAY - reject metadata on EXECUTION_CONTEXT_VARIABLE and forbid setting both writeTypes and ddlWriteTypes (they share the write_types field), so entries always round-trip - make the (de)serializer classes package-private - use InputStream.readAllBytes() in the spec-fixtures test 🤖 Generated with AI

@return

The upstream main now runs `core:javadoc` with `-Xdoclint:all -Xwerror` (added in substrait-io#949), which fails the build on any missing-javadoc warning. Flattening the dialect model into top-level public types exposed their members to this check, so document every public enum constant and every public accessor / factory method (with @param/@return) following the existing convention (e.g. io.substrait.hint.Hint).

nielspardon requested review from andrew-coleman, benbellick, bestbeforetoday and vbarua June 26, 2026 07:42

bestbeforetoday reviewed Jun 26, 2026

View reviewed changes

nielspardon added 3 commits June 29, 2026 08:36

Merge remote-tracking branch 'upstream/main' into par-dialect

164ac1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(core): add Substrait dialect support#861

feat(core): add Substrait dialect support#861
nielspardon wants to merge 4 commits into
substrait-io:mainfrom
nielspardon:par-dialect

nielspardon commented Jun 10, 2026 •

edited

Loading

Uh oh!

bestbeforetoday left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nielspardon commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Usage

Design

Validation & tests

Uh oh!

bestbeforetoday left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nielspardon commented Jun 10, 2026 •

edited

Loading