LALR(1) parser from official MySQL grammar#429
Draft
JanJakes wants to merge 11 commits into
Draft
Conversation
Contributor
🤖 Lexer benchmarkChanges to lexer-related files were detected and triggered a benchmark:
Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally. To reproduce locally: |
df8874b to
70d642d
Compare
b3c39da to
1f88932
Compare
Add a new monorepo package for a MySQL parser generated from the official MySQL grammar. This commit sets up the package metadata and a README describing the design and layout; source and tooling follow.
Bring the MySQL lexer and the token and node classes over from the mysql-on-sqlite package unchanged, so the later adaptation to the official grammar is reviewable as a focused diff, and register src/ as the package Composer classmap (the WordPress-style file names rule out PSR-4).
a90ea51 to
b8fb251
Compare
Compile the grammar from the official MySQL sources: fetch sql_yacc.yy and lex.h at a pinned, checksum-verified mysql-server tag; run a pinned Bison build (Docker, version-asserted) to produce the automaton; compact the automaton into plain PHP ACTION/GOTO tables (about 7% of the dense cells); and derive the keyword table and token constants from lex.h, failing the build on any unresolved terminal. bin/build-grammar (composer run build-grammar) runs the pipeline end to end.
Commit the artifacts produced by bin/build-grammar, both plain PHP arrays: the LALR(1) parse table and the token-level data (the keyword table, the paren-gated function keywords, and the token constants). Regenerate with composer run build-grammar.
Make the lexer emit the grammar's own token numbers, with the keyword table generated from lex.h: keyword synonyms, paren-gated function keywords, and dropped keywords all follow MySQL's own data. Diagnostic token names are derived on demand instead of shipping a name map. The lexer produces MySQL's grammar token stream directly, the way MySQL's own lexer does, rather than scanning a different token model and reconciling it in a separate pass: "@" is a standalone terminal followed by its name, "WITH ROLLUP" is contracted via a one-token lookahead, NOT becomes NOT2 under HIGH_NOT_PRECEDENCE, and the input ends with END_OF_INPUT and Bison's end marker (omitted on invalid input). The pull iterator (next_token/get_token) and remaining_tokens() both yield this single stream; the scanner's internal sentinels stay private and never reach it.
24d4f21 to
296a9c5
Compare
A table-driven LALR(1) shift-reduce runtime over the generated ACTION/GOTO tables, building a WP_Parser_Node AST. The grammar is unambiguous for LALR(1), so the loop is deterministic, with no conflict handling or backtracking. Streams missing the $end terminator (the lexer's invalid-input output) are rejected. Adapt the copied parse-tree primitives to the package: the runtime builds each node in a single step, so the old recursive parser's merge_fragment() is dropped, and the node and token docblocks no longer reference that parser.
An opt-in constructor flag that passes single-node unit productions through instead of wrapping them. Such reductions are over half of the total, so inlining adds 20-30% throughput; the wrapper rule names are then absent from the tree, which is why the flag is off by default.
Bring the query corpus extracted from the MySQL server test suite, with the tooling that generates it, into the package: data/mysql-server-query-corpus/ plus a bin/build-corpus orchestrator (composer run build-corpus) that fetches the mysql-test directory at the pinned tag and extracts the queries. The SQLite driver package keeps its own copy for now; it will be retired when the driver is ported to this package.
Measure the corpus parse rate and end-to-end (lex + parse) throughput, with warmup and timed passes; --inline-units benchmarks the collapsed-AST mode. The parser accepts 99.88% of the ~69.5k corpus queries.
Cover the token stream, the scanner (the exhaustive unit suite ported from the SQLite driver), the parser runtime, token value and name resolution, generated grammar-data invariants, and a corpus regression test pinning the exact acceptance tally. Run the suite on the oldest and newest supported PHP versions in CI.
Cut the generated parse table from 231 KB to 182 KB, with no behavior change: modal shift targets stored as bare token lists, GOTO exceptions keyed by nonterminal instead of state, rule names indexed by the contiguous lhs ids, and short array syntax in the emitted literals. The smaller file also parses faster on a cold opcache.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
The changed line numbers are misleading, as 115,000 added lines is just a testing query corpus copied to the new
mysql-parserpackage frommysql-on-sqlite.A new experimental package,
packages/mysql-parser, that builds a MySQL parser directly from MySQL's own grammar. It compiles MySQL 8.4 LTS'ssql_yacc.yyandlex.h(unchanged) with the Bison version MySQL uses into a compact parse table, run by a small deterministic LALR(1) parser. The accepted language tracks a real MySQL release exactly, with no hand-maintained grammar to drift.What it does
composer run build-grammar): fetch pinned, checksum-verified sources → Bison 3.8.2 in Docker (version-asserted) → generate the parse table and the grammar's token data. Re-running reproduces the committed artifacts byte-for-byte. Both artifacts are plain PHP arrays — no binary blobs; the parse table compacts the 4.6M-cell ACTION/GOTO matrix to ~7% via default reductions, shared rows, and patch-encoded near-duplicate rows (182 KB).lex.h— no translation layer. Keyword synonyms, paren-gated function keywords, and dropped keywords all come from MySQL's own data, and the lexer mirrors MySQL's scanner-level quirks (WITH ROLLUPcontraction,@handling incl. empty host names,HIGH_NOT_PRECEDENCE).WP_MySQL_Parser) — the 8.4 grammar is unambiguous for LALR(1), so it's a plain shift-reduce loop: no GLR, backtracking, or conflict tables.What it doesn't do
WP_Parser_Nodetree, not a typed AST.Optional: unit-production inlining
MySQL's expression grammar nests a dozen single-child wrapper rules (
expr → bool_pri → predicate → bit_expr → …), and those unit reductions account for over half of all reductions. An opt-in constructor flag (separate commit) collapses them while building the AST — the child replaces the would-be wrapper — roughly halving node allocations for +21% (no JIT) / +30% (JIT) throughput.It's off by default because it changes the resulting tree: wrapper rule names are absent, so consumers (e.g. a future driver translation layer, which today navigates by names like
expr) must match only meaningful, multi-child or token-bearing rule names. Keeping it a flag lets that trade-off be decided when an AST consumer is actually built.Numbers
Same machine, ~69.5k-query corpus, end-to-end (lex + parse). Winner in bold.
~5.5× faster steady-state parsing without the JIT, ~4.6× with it — rising to ~6.6× / ~6.0× with unit-production inlining. Boot is roughly a wash (the LL parser is slightly cheaper cold, this one slightly cheaper in a warm opcache worker). The trade-off for the speed is single-version scope and a raw AST.