Skip to content

LALR(1) parser from official MySQL grammar#429

Draft
JanJakes wants to merge 11 commits into
trunkfrom
lalr-parser
Draft

LALR(1) parser from official MySQL grammar#429
JanJakes wants to merge 11 commits into
trunkfrom
lalr-parser

Conversation

@JanJakes

@JanJakes JanJakes commented Jun 10, 2026

Copy link
Copy Markdown
Member

Note

The changed line numbers are misleading, as 115,000 added lines is just a testing query corpus copied to the new mysql-parser package from mysql-on-sqlite.

A new experimental package, packages/mysql-parser, that builds a MySQL parser directly from MySQL's own grammar. It compiles MySQL 8.4 LTS's sql_yacc.yy and lex.h (unchanged) with the Bison version MySQL uses into a compact parse table, run by a small deterministic LALR(1) parser. The accepted language tracks a real MySQL release exactly, with no hand-maintained grammar to drift.

What it does

  • Reproducible build (composer run build-grammar): fetch pinned, checksum-verified sources → Bison 3.8.2 in Docker (version-asserted) → generate the parse table and the grammar's token data. Re-running reproduces the committed artifacts byte-for-byte. Both artifacts are plain PHP arrays — no binary blobs; the parse table compacts the 4.6M-cell ACTION/GOTO matrix to ~7% via default reductions, shared rows, and patch-encoded near-duplicate rows (182 KB).
  • Copies the existing lexer and adapts it to the grammar natively: token ids are the grammar's own token numbers, and the keyword table is generated from MySQL's lex.h — no translation layer. Keyword synonyms, paren-gated function keywords, and dropped keywords all come from MySQL's own data, and the lexer mirrors MySQL's scanner-level quirks (WITH ROLLUP contraction, @ handling incl. empty host names, HIGH_NOT_PRECEDENCE).
  • Deterministic runtime (WP_MySQL_Parser) — the 8.4 grammar is unambiguous for LALR(1), so it's a plain shift-reduce loop: no GLR, backtracking, or conflict tables.
  • A PHPUnit suite with a CI job — token stream, runtime, token API, grammar-data invariants, and a corpus regression test pinning the exact acceptance tally — plus a corpus benchmark.

What it doesn't do

  • Doesn't replace the current parser — it's standalone and nothing depends on it. It reuses the current parser's class names, so the two can't be loaded in one process.
  • Single-version: tracks 8.4 exactly, so it rejects ~0.12% of the corpus (pre-8.4 / removed syntax, multi-statement input, non-default session SQL modes).
  • Builds a raw WP_Parser_Node tree, not a typed AST.
  • The build needs Docker; the runtime stays pure PHP (7.2+, no extensions).

Optional: unit-production inlining

MySQL's expression grammar nests a dozen single-child wrapper rules (expr → bool_pri → predicate → bit_expr → …), and those unit reductions account for over half of all reductions. An opt-in constructor flag (separate commit) collapses them while building the AST — the child replaces the would-be wrapper — roughly halving node allocations for +21% (no JIT) / +30% (JIT) throughput.

It's off by default because it changes the resulting tree: wrapper rule names are absent, so consumers (e.g. a future driver translation layer, which today navigates by names like expr) must match only meaningful, multi-child or token-bearing rule names. Keeping it a flag lets that trade-off be decided when an AST consumer is actually built.

Numbers

Same machine, ~69.5k-query corpus, end-to-end (lex + parse). Winner in bold.

Metric LL (trunk) LALR (this) LALR + inlining
Throughput, no JIT 10,910 QPS 60,932 QPS 72,734 QPS
Throughput, warm JIT 25,686 QPS 117,383 QPS 155,281 QPS
Cold boot, no opcache ~2.8 ms ~3.1 ms ~3.1 ms
Warm boot, opcache+JIT ~0.34 ms ~0.30 ms ~0.30 ms
Memory, no opcache ~3.4 MB ~4.7 MB ~4.7 MB
Memory, opcache worker ~1.5 MB ~2.6 MB ~2.6 MB
Generated parser/table file size 65 KB 182 KB 182 KB
Full size (lexer + parser + grammar) 246 KB 287 KB 287 KB
Parse rate 99.99% 99.88% 99.88%

~5.5× faster steady-state parsing without the JIT, ~4.6× with it — rising to ~6.6× / ~6.0× with unit-production inlining. Boot is roughly a wash (the LL parser is slightly cheaper cold, this one slightly cheaper in a warm opcache worker). The trade-off for the speed is single-version scope and a raw AST.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🤖 Lexer benchmark

Changes to lexer-related files were detected and triggered a benchmark:

Config Base (QPS) This PR (QPS) Speedup
no JIT 73,082 72,859 1.00×
tracing JIT 155,119 154,228 0.99×

Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally.

To reproduce locally:

cd packages/mysql-on-sqlite && composer run bench-lexer

@JanJakes JanJakes changed the title Add an experimental MySQL parser built from the official 8.4 grammar Experiment: LALR(1) parser from official MySQL grammar Jun 10, 2026
@JanJakes JanJakes force-pushed the lalr-parser branch 9 times, most recently from df8874b to 70d642d Compare June 11, 2026 15:43
@JanJakes JanJakes changed the title Experiment: LALR(1) parser from official MySQL grammar LALR(1) parser from official MySQL grammar Jun 12, 2026
@JanJakes JanJakes force-pushed the lalr-parser branch 4 times, most recently from b3c39da to 1f88932 Compare June 12, 2026 15:27
JanJakes added 2 commits June 12, 2026 21:05
Add a new monorepo package for a MySQL parser generated from the official
MySQL grammar. This commit sets up the package metadata and a README
describing the design and layout; source and tooling follow.
Bring the MySQL lexer and the token and node classes over from the
mysql-on-sqlite package unchanged, so the later adaptation to the official
grammar is reviewable as a focused diff, and register src/ as the package
Composer classmap (the WordPress-style file names rule out PSR-4).
@JanJakes JanJakes force-pushed the lalr-parser branch 2 times, most recently from a90ea51 to b8fb251 Compare June 12, 2026 19:19
JanJakes added 2 commits June 12, 2026 22:33
Compile the grammar from the official MySQL sources: fetch sql_yacc.yy and
lex.h at a pinned, checksum-verified mysql-server tag; run a pinned Bison
build (Docker, version-asserted) to produce the automaton; compact the
automaton into plain PHP ACTION/GOTO tables (about 7% of the dense cells);
and derive the keyword table and token constants from lex.h, failing the
build on any unresolved terminal. bin/build-grammar (composer run
build-grammar) runs the pipeline end to end.
Commit the artifacts produced by bin/build-grammar, both plain PHP arrays:
the LALR(1) parse table and the token-level data (the keyword table, the
paren-gated function keywords, and the token constants). Regenerate with
composer run build-grammar.
Make the lexer emit the grammar's own token numbers, with the keyword table
generated from lex.h: keyword synonyms, paren-gated function keywords, and
dropped keywords all follow MySQL's own data. Diagnostic token names are
derived on demand instead of shipping a name map.

The lexer produces MySQL's grammar token stream directly, the way MySQL's own
lexer does, rather than scanning a different token model and reconciling it in
a separate pass: "@" is a standalone terminal followed by its name, "WITH
ROLLUP" is contracted via a one-token lookahead, NOT becomes NOT2 under
HIGH_NOT_PRECEDENCE, and the input ends with END_OF_INPUT and Bison's end
marker (omitted on invalid input). The pull iterator (next_token/get_token)
and remaining_tokens() both yield this single stream; the scanner's internal
sentinels stay private and never reach it.
@JanJakes JanJakes force-pushed the lalr-parser branch 2 times, most recently from 24d4f21 to 296a9c5 Compare June 13, 2026 13:17
JanJakes added 3 commits June 13, 2026 16:09
A table-driven LALR(1) shift-reduce runtime over the generated ACTION/GOTO
tables, building a WP_Parser_Node AST. The grammar is unambiguous for
LALR(1), so the loop is deterministic, with no conflict handling or
backtracking. Streams missing the $end terminator (the lexer's invalid-input
output) are rejected.

Adapt the copied parse-tree primitives to the package: the runtime builds
each node in a single step, so the old recursive parser's merge_fragment() is
dropped, and the node and token docblocks no longer reference that parser.
An opt-in constructor flag that passes single-node unit productions through
instead of wrapping them. Such reductions are over half of the total, so
inlining adds 20-30% throughput; the wrapper rule names are then absent from
the tree, which is why the flag is off by default.
Bring the query corpus extracted from the MySQL server test suite, with the
tooling that generates it, into the package: data/mysql-server-query-corpus/
plus a bin/build-corpus orchestrator (composer run build-corpus) that
fetches the mysql-test directory at the pinned tag and extracts the queries.
The SQLite driver package keeps its own copy for now; it will be retired
when the driver is ported to this package.
JanJakes added 3 commits June 13, 2026 16:09
Measure the corpus parse rate and end-to-end (lex + parse) throughput, with
warmup and timed passes; --inline-units benchmarks the collapsed-AST mode.
The parser accepts 99.88% of the ~69.5k corpus queries.
Cover the token stream, the scanner (the exhaustive unit suite ported from
the SQLite driver), the parser runtime, token value and name resolution,
generated grammar-data invariants, and a corpus regression test pinning the
exact acceptance tally. Run the suite on the oldest and newest supported PHP
versions in CI.
Cut the generated parse table from 231 KB to 182 KB, with no behavior
change: modal shift targets stored as bare token lists, GOTO exceptions
keyed by nonterminal instead of state, rule names indexed by the contiguous
lhs ids, and short array syntax in the emitted literals. The smaller file
also parses faster on a cold opcache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant