Skip to content

Emitted parser to 0.84x tsc, span-only CST, FN=0 vs tsc's parse surface, generated per-arm destructurers#35

Open
johnsoncodehk wants to merge 42 commits into
masterfrom
parser-exec-trace
Open

Emitted parser to 0.84x tsc, span-only CST, FN=0 vs tsc's parse surface, generated per-arm destructurers#35
johnsoncodehk wants to merge 42 commits into
masterfrom
parser-exec-trace

Conversation

@johnsoncodehk

@johnsoncodehk johnsoncodehk commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Four workstreams on the #8 perf line, each gated end-to-end. Net state: the emitted parser parses the bench corpus at 0.84× tsc's parse-only time (0.64× vs setParentNodes: true), the CST contract is span-only minimal, acceptance vs tsc's real parse surface has zero false negatives corpus-wide, and consumers destructure CST nodes through generated, typed per-arm matchers instead of child-shape probing.

Performance: 4.41× → 0.84× vs tsc parse-only

file KB mono tsc(parse) tsc(parents) ratio ratio(parents)
parserharness.ts 82 3.93ms 3.78 4.95 1.04× 0.80×
fixSignatureCaching.ts 67 0.72ms 1.51 1.88 0.48× 0.38×
parserRealSource7.ts 38 1.11ms 1.51 2.11 0.73× 0.52×
parserindenter.ts 35 0.89ms 1.07 1.48 0.82× 0.60×
aggregate 6.65ms 7.87 10.41 0.84× 0.64×

Profiled with V8 inspector profiles (layer split, line ticks, a --no-turbo-inlining pass for honest attribution), one gated commit per lever. The line profile is now flat — top line 3.8% — i.e. the JS-op level is harvested; the remaining gap layers are GC (14%, the kept CST itself) and the lexer floor, both representation/substrate questions, not op-shaving.

Lever groups (full list in the commit log):

  • Lexer: per-charCode dispatch pruning; identifier char-loop; lexKwT baked keyword recognizer over source spans (no slice, no hash); tokens born final (k/t interned at creation); a dedicated emitted lexer (src/emit-lexer.ts).
  • SoA token stream: tokens are five parallel columns (tkK/tkT bytes, tkOff/tkEnd Int32, tkFl bits) written by the emitted tokenize; token text is never materialized — keyword matching reads source charCodes, peek() is gone, '>'-resplit is a copyWithin.
  • Dispatch: FIRST membership as byte tables; per-rule alt bitmasks with SECOND-token refinement; LED first-token gates open-coded; Pratt operator tables keyed by interned literal ints; multi-word masks — the 32-alt ceiling silently dropped a rule to serial guards when a grammar widening pushed it past 32 (both Type and Expr crossed it, costing ~25% whole-parse with a perfectly flat profile); masks now span words, so dispatch degrades smoothly with grammar growth.
  • Memo: per-rule parallel position arrays; advance-site maxPos.

Peers: the reference parser of each ecosystem

test/profile-vs-peers.mjs, same methodology. JavaScript vs acorn (the ESTree reference) on real-world files — acorn parsing itself, vue compiler-core, parse5's parser, the 8.9MB typescript.js bundle — lands at parity (0.85–1.1× per file across runs; the 9MB row swings ±10% run-to-run):

file KB mono acorn ratio
acorn.js 237 11.01ms 11.74 0.92×
compiler-core.cjs.js 200 9.45ms 10.84 0.87×
parse5 parser/index.js 105 4.78ms 4.37 1.09×
typescript.js 8,899 611ms 567 1.08×

HTML vs parse5 (the WHATWG reference) on synthesized well-formed fragments: ~2.2× — the named gap is that markup-mode lexing still runs the interpreted lexer (emit-lexer specialization covers the token-stream grammars, not markup mode yet). Output shapes differ by design (full CST with every token as a leaf vs ESTree AST vs DOM-shaped tree), so both tables are parse-to-tree wall time on identical inputs.

Benching acorn.js exposed a real grammar bug: break/continue labels missed the spec's restricted production (break [no LineTerminator here] Label) and the reserved-word guard, so breakcase "X": inside a switch ate case as a label and cascaded into a reject. Both ECMAScript grammars now guard the label with (sameLine, notReserved).

Measured-flat-or-negative and reverted, for the record: persistent memo arrays, string quote-loop, switch-direct dispatch, probe-then-build (rejected by design analysis — memo shares loser-arm subtrees with the winner, so the discard waste is only 3-5%).

CST contract: span-only, two shapes, nothing derivable

leaf: { tokenType, offset, end }            // text  ≡ source.slice(offset, end)
node: { rule, children, offset, end }       // leaf? ≡ 'tokenType' in n / 'children' in n

text and kind are gone (both derivable; both measured as pure overhead — dropping them was +7% and +5.5%). getText(node, source) is exported by both engines. test/cst-text-invariant.ts pins the exact shape over all seven grammars (generative corpora + a TS corpus stride).

Acceptance: FN = 0 against tsc's parse surface

Measured bidirectionally over the full conformance corpus (5,659 files, monogram verdict × tsc parseDiagnostics): both-accept 5,112 · both-reject 258 · false negatives 0 · over-accepts 289. Every construct tsc parses cleanly now parses — each residual FN was probed against tsc and widened to the proven surface: JSDoc types in normal TS positions (T?/T!/?T/!T/*/function(...)/Array.<number>, with tsc's exact isStartOfType disambiguation on postfix ?), decorator expressions (@x?.y, tagged templates, @new x), decorator placement (@dec var x, export @dec default class), object-literal and parameter modifier soups, ?/! after object member names, rest-binding ...r: n, nameless class declarations, string import/export specifiers, typeof import("m").Thing, and async arrows with a bare parameter (async err => … previously mis-split into two statements — a structural bug that accept-level metrics could not see).

Three more structure bugs surfaced by the new structural oracle and fixed: mixfix-LED precedence (grammar.ledPrecs, an orthogonal precedence field so seven highlighter consumers never see it), the ES2023 §14.5 expression-statement lookahead (function/async function/class), and bare for (x in y) (the exclude('in', …) no-in context).

The dominant remaining over-accept class is newline-blind ASI (same-line statement splits tsc rejects) — a future round of its own.

Generated consumer toolkit

npm run gen emits, per grammar, <g>.cst-types.ts (typed node unions) and <g>.cst-match.ts (per-arm destructurers): matchStmt(n, src) returns a tagged { arm, …named fields } union, so consumers write switch (m.arm) instead of probing children. Matcher semantics mirror the engine exactly (literal token kinds, sep trailing delimiters, template duals, op forms) and test/cst-match-totality.ts proves totality over real CSTs. Both are generated, not committed — CI regenerates before typecheck.

The dogfood consumer is test/ts-ast-lowering.ts, a tsc-shaped AST lowering verified node-by-node (kind + trivia-excluded spans) against the real tsc tree by test/ts-ast-verify.ts — the gate that caught the precedence and statement-merge bugs above.

Gates

29 in test/check.ts (was 26): + cst-text-invariant, ts-ast-structure, cst-match-totality; plus the engine-parity pins outside the runner: emit-parser-verify (18,805-file emit ≡ interp byte-identical), emit-reject-messages (436 both-reject files, exact farthest-pos message — an arm pruned by one engine but not the other skews error state), emit-lexer-verify (token-stream equality). The tree-sitter LR-conflict closure is complete for the widened grammars (test/collect-conflicts.ts fixpoint), and gen-tm keeps paren-gated type keywords (function( in type position) in the statement-start set so multi-line type regions still close at a real declaration.

A standalone, debug-only tracer: startTrace()/endTrace() markers (near-free
no-ops in production, src/trace-markers.ts) bound a region; test/exec-trace.ts
AST-instruments a copy of the target (TS compiler API — no new deps) so every
executed statement/initializer/return/condition records its line, source text,
runtime value and a cost tag {alloc|map.get|call}, gated on the markers. The
flattened executed source is printed with calls inlined and repeated call-frames
grouped by DISTINCT execution path (each path once + count + per-line value
distribution), so a hot function's N calls read as a few paths, not N×.

Scopes: the time-region markers, or --lines=A-B for a spatial slice. Reads the
actual run, so it reflects the real engine, not an approximation.
A negative lookahead over a literal/alternation of keyword literals (e.g. an
identifier that isn't a reserved word, `not('catch'|'delete'|…)`) was matched by
trying each literal in turn — O(N) matchLiteral calls plus an `out` allocation per
arm, at every identifier position (the hottest in the grammar). The keyword set is
static, so collapse it to one membership test: the not fails iff the token is an
ident-kind whose text is one of the keywords. Mirrored in createParser
(gen-parser.ts notKwSet) and emitParser (emit-parser.ts notKwKinds), emitting the
same check matchKwLit uses → byte-identical.

Emitted parser ~5% faster (parserharness 19.7→18.7 ms), interpreter ~9–12%; the
reserved-word scan for `a + b` drops from 12 matchLiteral calls to 0. Byte-identical
CST + accept/reject (createParser ≡ emitParser, run-conformance 5386/5659, 26/26 gates).
…urn directly

Two emit-only rewrites (the interpreter is the oracle; emit≡interp is gated
byte-identical on the full 18,805-file corpus):
- descMatchesTok's switch body is inlined into ruleMightStartDescs and
  canStartFT, removing the dead helper and the call boundary on the
  per-token dispatch loops.
- Matchers reducible to a single literal/token-ref/rule-ref emit
  'const v = match(); return [v]' directly instead of out=[]+push (96 of
  the TS grammar's matchers specialize).

Measured ~1-2% on the PR#4 bench files (parserharness +2.3%).
…loop

The longest-match alt guards and rule-ref guards tested FIRST-set
membership by looping a descriptor array per call (ruleMightStartDescs)
— the CPU profile showed it at ~13% self time. Each set now emits a
per-set fn over two Uint8Array tables indexed by the token's baked ints:
  _qN(tok) = !tok || (KT[tok.k] | TT[tok.t]) !== 0
two loads + an or, no loop.

Faithful to the loop because the keyword and punct int ranges are
disjoint (the loop's k!==K_PUNCT / k===K_PUNCT guards were redundant)
and the punct startsWith arm is enumerated over the closed punct
vocabulary at emit time. Tables and fns are deduped across sets
(typescript: 103 fns / 103 shared arrays).

Gates: full 18,805-file corpus byte-identical, conformance, 26/26 check.
Bench (PR#4 files, interleaved best-of-N): aggregate +10.6~12.0%,
every file positive (+5~14%).
The CPU profile (vs tsc's scanner) showed the lexer's cost is not the
token regexes but the data-driven dispatch around them. Three prunings,
all grammar-agnostic:

- punct literals: first-char group index. Only literals sharing the
  position's first char can startsWith-match, so the longest-first scan
  runs over that group instead of all literals (TS: 57 -> ~3).
- token matchers: the template/markup/indent exclusions are
  position-independent — filter the matcher list once at createLexer
  time instead of re-testing three Set.has per matcher per position.
- whitespace: consume the ASCII \s run ({9..13, 32}) with a char loop;
  only a non-ASCII candidate falls back to the \s regex (Unicode
  spaces), preserving exact \s semantics and the includes('\n') ->
  newlineBefore stamping.

Gates: token-stream equality vs the old lexer over the 5,695-file
conformance corpus (5,624 identical streams + 71 identical-message
throws, 0 diffs), conformance 5386/5659 unchanged, 26/26 check (all
grammars). Bench (PR#4 files): tokenize +87~95%, whole parse +35~46%.
canStartFT — the per-LED single-descriptor switch in the Pratt dispatch
loop — was the next profile hotspot (~7% self). A first-token gate is a
1-element FIRST set, so it reuses the membershipFn byte tables, open-
coded at the call site (tok is already known non-null there):
  (KT[tok.k] | TT[tok.t]) !== 0
no call, no switch. The canStartFT runtime helper and the descriptor
literals (keyDescLiteral/firstTokDescLiteral) are gone.

Gates: full 18,805-file corpus byte-identical, 26/26 check.
Bench: aggregate +7.8~10.3%, every file positive.
…terned fields

The no-inlining profile attributed ~14% of parse time to the token
interning loop: internTok ADDED k/t to already-shaped lexer tokens (two
hidden-class transitions per token, on top of the conditional
newlineBefore stamp making shapes polymorphic at every tok.k/tok.t site).

The tokenize wrapper now rebuilds each token as a single literal with
every field the parser reads (type/text/offset/k/t + the three stamp
flags normalized to booleans) — one shape from birth. The matchPuLit
'>'-split tokens go through the same mkPunct shape. internTok is gone.

Gates: full 18,805-file corpus byte-identical, 26/26 check.
Bench: aggregate +3.1~5.4% across three runs.
The LED loop did a string-keyed opTable.get(tok.text) for every token it
reached (and the prefix-op nud a prefixOps.get); tok.t is already
interned, so both become an array load (OP_BY_T / PREFIX_BY_T, null-
packed, length = the literal-int space). Equivalent because a token's
text can equal an operator value only for punct tokens and keyword-
shaped idents — exactly the classes tok.t indexes; operator values are
in the literal vocabulary by construction (asserted at emit).

Gates: full 18,805-file corpus byte-identical, 26/26 check.
Bench: aggregate +1.8~2.8% over three runs.
Every multi-alt site (non-rec alts, left-rec atoms, Pratt nuds) guarded
each alternative with its own membership-fn call — R_Stmt burned ~20 _q
calls per statement position. Each alt list (3..32 alts) now bakes one
bit per alt into two Int32Array mask tables over the token int spaces:
  mask = startTok ? KM[startTok.k] | TM[startTok.t] : ALL
and each alt's guard is one bit test. Bit i is set exactly where the old
altGuard was true (always-tried alts in every k slot; EOF admits all),
so the same alt subset runs in the same order — byte-identical by
construction. _q(startTok) guard calls: 162 -> 15 (the sub-3-alt lists).

Gates: full 18,805-file corpus byte-identical, 26/26 check.
Bench: aggregate +10.4~15.2%, every file positive in both runs.
…ation

The emitted parser rebuilt every lexer token (second allocation +
TYPE_KIND/LIT_KW/LIT_PU dictionary lookups per token) to attach the int
kinds — ~21% of parse time after the dispatch rounds, plus the GC share.

createLexer now takes an optional intern config (typeKind/kwLit/puLit
maps + punct/fallback kinds) and every creation site builds the FINAL
token through two single-allocation builders:
- mkNamed(type, text, offset, k): k is baked per site — token matchers
  and prefixed-ident entries carry their kind, fixed-name sites
  (template/markup/indent/newline) use consts computed once; t is the
  text's keyword int (one Map.get — the same lookup tsc's
  getIdentifierToken pays).
- mkPu(text, offset, t): punct t is precomputed per first-char-group
  entry — zero runtime lookup.
Stamp flags are real fields from birth (push writes existing fields, no
hidden-class transitions). The three post-push mutations re-intern
exactly (flow plain-fold merge, markup void retag, unicode ident
extension). Without an intern config (the interpreter) k/t are 0 via an
empty map; the interpreter's token shape is unchanged in behavior.

The emitted tokenize wrapper is gone — the lexer's array is the
parser's array; matchPuLit '>'-splits build the same shape.

Gates: token-stream equality 5,695 files (0 diffs), full 18,805-file
corpus byte-identical, 26/26 check.
Bench (strict A/B vs the full previous state): aggregate +12.8~13.0%,
every file positive (fixSignatureCaching +19~23%).
@johnsoncodehk johnsoncodehk changed the title Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 2.57x Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 1.97x Jun 9, 2026
emitLexer(grammar) specializes tokenize() for token-stream grammars: one
switch(charCode) whose cases hold exactly that char's candidate token
regexes (declaration order) and a longest-first punct compare-chain with
the literal's int baked at each leaf. Token regexes stay V8 regexes (the
dispatch was the measured cost, not the matching). Regex-vs-division
context, paren-head tracking and the template state machine are baked:
prevIsValue is four int-table loads, the '('/')'/'!' bookkeeping exists
only on those punct branches, scanTemplateSpan/identTextValid are
emitted with their config inlined. Grammars using markup/indent/newline
fall back to the createLexer import (emission unchanged for yaml/html).

New gate test/emit-lexer-verify.ts: emitted stream ≡ createLexer over
the 5,695-file conformance corpus — every field including k/t and the
stamp flags, plus identical error messages (5,624 same + 71 same-throw,
0 diffs).

Gates: full 18,805-file corpus byte-identical, conformance 5386/5659
unchanged, 26/26 check.
Bench: whole parse +57~77%; vs ts.createSourceFile the aggregate is now
1.24x (parse-only) / 0.94x (setParentNodes:true) — below parity on the
latter, with fixSignatureCaching at 0.69x even against parse-only.
@johnsoncodehk johnsoncodehk changed the title Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 1.97x Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 1.24x Jun 9, 2026
@johnsoncodehk

Copy link
Copy Markdown
Owner Author

Added the emitted lexer (f4ceb13): emitLexer(grammar) bakes the whole per-position dispatch — per-charCode switch over candidate token regexes + longest-first punct compare-chains with interned ints at the leaves, baked regex-context tables, emitted template/ident-escape machinery. Markup/indent/newline grammars fall back to the createLexer import unchanged. New gate test/emit-lexer-verify.ts (token streams + error messages ≡ createLexer, 5,695 files, 0 diffs).

Updated headline (PR #4 bench files, aggregate):

mono ms vs tsc parse-only vs tsc setParentNodes:true
session start 34.8 4.41× 3.33×
now 9.8 1.24× 0.94×

Per file vs parse-only: fixSignatureCaching 0.69×, parserindenter 1.23×, parserRealSource7 1.28×, parserharness 1.45×. Remaining hotspots: R_Expr_pratt ~15%, GC ~18% (CST/token allocation), lexMk ~5%, matchPuLit ~4.6%.

- The Pratt LED chain re-tested maxBp > minBp per LED and gated each LED
  with its own two-table load; the shared test is hoisted and the
  per-LED first-token gates collapse into one mask pair (ftMaskDispatch,
  same Int32Array machinery as the alt dispatch).
- matchKwLit/matchPuLit: the t int ranges are disjoint, so the
  k >= K_NAMED_MIN / k === K_PUNCT guards were redundant — one int
  compare each. The '>'-split moves to matchPuLitGT, emitted only at '>'
  call sites.
- Fixes a latent bug the gates could not reach: the generic matchLiteral
  fallback still indexed LIT_KW/LIT_PU as objects after they became Maps
  (dead on the TS grammar's fully-specialized sites, wrong if ever
  reached).

Gates: emit-lexer-verify 5,695 files 0 diff, full 18,805-file corpus
byte-identical, 26/26 check.
Bench: aggregate +7.5~14.5%, every file positive in both runs.
@johnsoncodehk

Copy link
Copy Markdown
Owner Author

Round 3 (commit ea3f74f): LED chain bitmasked + hoisted maxBp test; matchKwLit/matchPuLit to one int compare (disjoint t ranges make the k guards redundant; '>'-split isolated to matchPuLitGT); fixed a latent Map-indexing bug in the generic matchLiteral fallback. +7.5~14.5%.

Aggregate now 9.3ms — 1.20× vs tsc parse-only / 0.87× vs setParentNodes:true (per file vs parse-only: 0.66× / 1.11× / 1.21× / 1.45×).

- parseRuleEntry looked the memo up through a string-keyed outer Map per
  rule entry; each memoized rule now owns an array slot allocated at
  emit time (memo = new Array(MEMO_RULES), hit = memo[idx]).
- lexMk did a LIT_KW.get for every named token; a matcher (or template /
  prefixed-ident site) whose first-char set is disjoint from every
  keyword's first char provably interns t = 0 — those sites emit a
  lookup-free builder (lexMk0). The identifier matcher keeps the lookup
  (the same one tsc's getIdentifierToken pays).

Gates: emit-lexer-verify 5,695 files 0 diff (k/t compared), full
18,805-file corpus byte-identical, 26/26 check.
Bench: aggregate +1.4~5.8% across four runs (machine noisy; sign stable).
@johnsoncodehk

Copy link
Copy Markdown
Owner Author

Round 4 (34bb02d): memo keyed by emit-time rule index (array slot instead of string-keyed Map per rule entry); lexMk bakes t:0 for matchers whose first-char set provably can't start a keyword (lookup only remains on identifiers, mirroring tsc's getIdentifierToken). +1.4~5.8% across four runs. Cumulative: 34.8ms → ~9ms, 4.41× → ~1.2× vs tsc parse-only, ~0.85× vs setParentNodes:true.

…ness

- test/profile-vs-tsc.mjs: dual V8 inspector profiles (emitted parser /
  ts.createSourceFile) with the timing ratio table, lexer/parser/GC
  layer split and top self-time tables; run with --no-turbo-inlining
  for honest per-function attribution.
- test/profile-lines.mjs: positionTicks per (file,line) over a saved
  .cpuprofile, printed with source text — the op-level view.
- test/ab-emitted.mjs: the interleaved best-of-N A/B gate used to
  keep/revert every lever in PR #35.
tokenPatternCharLoop derives a charCode loop plan from seq(first, star(cont))-
shaped token IR: scan the plain continuation class with int compares, fall back
to the full regex only when the stop char could begin a complex alternative (an
escape opener). Exact by construction: plain chars and every complex
alternative's first chars are required disjoint, so the greedy star is
deterministic and a non-bail stop char provably ends the regex match too.

Kills the regex exec + match-array allocation on the hottest token class
(identifiers, ~40% of tokens); the escape-validation call is skipped on the
fast path since the loop proved no backslash. A/B on the PR#4 bench: +1.7% to
+7.8% aggregate across 4 runs, all positive; emit-lexer-verify (5,695 files,
all fields + error messages), 18,805-file byte-identical gate and 26/26 checks
pass.
peek() paid a parseLimit branch plus a ?? undefined-check on every call for a
cap that only two emitted mixfix re-parse sites ever set. Maintain
cap = min(parseLimit-or-infinity, tokens.length) at those sites, the '>'-splice
and parse() entry instead; peek is now one compare and an always-real token
load. (The ternary form measured NEGATIVE -2.5..-5.6% — V8 prefers the
early-return branch shape; the if-form measures +1.4..+5.4%, 5/6 runs
positive.) Byte-identical on the 18,805-file corpus, 26/26 gates.
lexMk paid a Map.get(text) per named token — a string hash over a fresh slice
every time (V8 can't reuse a cached hash on a new string). Emit lexKwT from the
keyword symtab instead: length window, first-charCode switch, then per-keyword
compare chains shortest-first. Returns exactly LIT_KW.get(text) ?? 0 by
complete enumeration; non-keyword identifiers exit at the length window or the
switch default in a couple of int compares. Also replaces the re-intern in the
unicode ident-extension path. A/B: +9.3..+10.5% aggregate over 3 runs
(parserharness +16% — identifier-dense); lexer-verify, 18,805-file
byte-identical and 26/26 gates pass.
parseRuleEntry paid a Map hash per entry and a {node, end} wrapper allocation
per store, at a measured 51.8% hit rate (memo is load-bearing — half the
entries are avoided longest-match re-parses). Replace each rule's Map with a
pair of arrays indexed by start pos, lazily sized to the token count: a lookup
is two undefined-sentinel array loads, a store allocates nothing.

The pair is captured together at entry and created together at store time —
a '>'-splice inside core() detaches both via fill(undefined) and the late store
lands in the detached pair (discarded), exactly the old Map's semantics; the
first cut captured only one side and crashed on memoNode[idx] === undefined
after an in-flight splice (caught by the 18,805-file gate, 318 divergences).

A/B: +6.8..+13.7% aggregate over 3 runs; byte-identical corpus and 26/26 gates
pass.
peek() updated the farthest-position high-water mark on every call — the
single hottest line in the profile (3.6% self) — for state that only error
messages read. Track it at pos-advance sites instead (one compare per consumed
token rather than per lookahead); restores only ever lower pos, and a memo-hit
restore needs no update since the stored end was recorded when first reached.

Error text is now pinned by a new gate, test/emit-reject-messages.ts: over the
full conformance corpus every both-reject file must throw the EXACT interpreter
message (466 files, 0 mismatches before and after). A/B: +1.1..+7.5% aggregate,
4/4 runs positive; byte-identical corpus and 26/26 gates pass.
…le bits

A token-name key gathered under a not(alt('if', 'var', ...)) guard immediately
before the first consuming element becomes a qualified key, emitted as TM[0]
plus every keyword t outside the guard class instead of the blanket k-bit — a
guarded-out keyword lookahead no longer admits alternatives it provably cannot
start (it would fail the not-guard before consuming). Pure keyword literals
inside their own guard class are dropped from the set entirely. Sharpens the
identifier-led bits of the Expr bare-ident nud, the object-literal shorthand
and class-name alternatives, and every _qN entry guard derived from them.
A/B: +2.1..+10.7%, 3/4 runs positive; byte-identical corpus, reject messages
identical, 26/26 gates.
… literals

exprFirst collapsed any rule whose body contains a [prefix, operand] form to
null/always-admit — FIRST(Expr) was unknown, so every expression-led
alternative and every Expr rule-entry guard admitted everything. A prefix item
consumes exactly one of the prefix-operator literals, so contribute those and
stop. The many(Stmt) loop guard now rejects '}' before entering R_Stmt at all
(1,790 dispatches with their arm runs gone per bench round; R_Stmt admitted
arms 16,052 → 14,260). A/B neutral on its own (-0.3/+0.1%) — kept as a strict
table sharpening and the prerequisite for second-token dispatch refinement.
Byte-identical corpus; 26/26 gates.
Compute per-alternative SECOND sets (the keys admissible as a match's second
token, plus whether a one-token match exists): an admitted alternative whose
SECOND set excludes the actual second token — and that cannot end after one —
provably fails, so its arm is skipped. Kills the labeled-statement arm at every
identifier-led statement without a ':' second token, the single-param arrow
head without '=>', and friends: R_Stmt admitted arms 14,260 → 9,560 and nud
arms 13,537 → 8,968 per bench round. The emitter folds it into the existing
alt-mask dispatch as a second pair of Int32 tables ANDed in (alts with
unknown/len1/nullable/empty SECOND keep their bit everywhere and in the
EOF-after-one mask); pratt op/prefix/postfix items contribute their operator
literal sets, and a '>' SECOND key admits every '>'-led punct so the
'>'-splice stays covered.

The pruning must be ENGINE-IDENTICAL: the first cut (emitted only) passed the
byte-identical corpus but tripped the reject-message gate — a skipped arm no
longer advanced the farthest-position error state that the interpreter's run
of the same arm did. So gen-parser gets the same analysis as altMightSecond
checks in its three dispatch loops, and the emit side computes SECOND from
PLAIN FIRST inputs (no reserved-qualified keys, prefix-to-top) so both engines
derive identical sets and identical prune decisions by construction.

A/B: +2.3..+15.6%, 6/7 runs positive; 18,805-file byte-identical, reject
messages identical (466 files), 26/26 gates.
@johnsoncodehk johnsoncodehk changed the title Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 1.24x Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 0.86x Jun 9, 2026
@johnsoncodehk

Copy link
Copy Markdown
Owner Author

Round 3 (post-compaction): aggregate 0.86× vs tsc parse-only / 0.64× vs setParentNodes=true (round start: 1.17× / 0.86×). parserharness — the expression-dense laggard — is now at 1.00×.

Levers this round, each A/B-gated (byte-identical 18,805-file corpus + 26/26 gates + new reject-message gate):

  • Char-loop fast path for ident-shaped tokens (tokenPatternCharLoop derives the loop plan from the token IR; escape stop-chars fall back to the regex): +1.7..+7.8%
  • peek() single-compare (fold parseLimit into a maintained cap bound): +2%
  • Baked keyword recognizer (lexKwT length-window + charCode trie replaces Map-hash on a fresh slice per token): +9.3..+10.5%
  • Memo as parallel position arrays (two array loads per lookup, zero-alloc stores; 51.8% measured hit rate — memo is load-bearing): +6.8..+13.7% (a persistent/generation-versioned variant measured NEGATIVE — old-gen write barriers — and was reverted)
  • maxPos moved from every peek() to the twelve advance sites, pinned by a new gate test/emit-reject-messages.ts (every both-reject corpus file must throw the exact interpreter message): +1.1..+7.5%
  • Reserved-aware FIRST keys (a not(keyword-class) guard narrows the dispatch bits; FIRST(Expr) became a real set via prefix-op literals): +2.1..+10.7%
  • SECOND-token dispatch refinement in both engines (per-alt second-token sets ANDed into the alt masks; R_Stmt admitted arms 14,260→9,560, nud arms 13,537→8,968; ported to gen-parser so prune decisions are engine-identical — the message gate caught the emitted-only version skewing farthest-position error state): +2.3..+15.6%

Inspired by the alien-signals / TSSLint performance notes (object shapes,
in-object property pressure, lazy materialization): the token stream becomes
five parallel columns — kind, literal id, span start/end, stamp bits — written
directly by the emitted tokenize. Token text is never materialized during
lexing; the baked keyword recognizer reads source charCodes over the span, and
a CST leaf slices the span only when it is built. peek() disappears entirely
(pos indexes the columns; matchers and the byte-table guards read tkK/tkT
directly), the '>'-split shifts the columns in place, and the column element
width is chosen at emit time (Uint8 when the id spaces fit a byte). Grammars
on the createLexer fallback (markup/indent/newline) convert the object stream
into the same columns at parse() entry with a text column, so the runtime has
a single form; tokenAt(i) reconstructs the object view for the lexer gate.

Measured: GC layer 17.6% -> 14.0%, the peek frame (6.2% self) and the token
constructors (lexMk/lexMkPu/push, ~4%) gone; wall time ~flat to +1.5% on the
bench (the earlier born-final/char-loop/lexKwT rounds had already removed most
lexer-side allocation cost, and per-matcher typed-array loads offset the rest).
Kept as the structural base: the remaining GC is parser-side CST building
(loser arms), which needs probe-then-build, not token shape. All gates green:
lexer stream identical (5,695 files), 18,805-file byte-identical corpus,
reject messages identical, 26/26.
classifyKey's 'tok' variant carries no t — narrow the mixfix separator (a
literal, so kw/punct in practice; -1 never-matches defensively), and
Emitter.a becomes readonly for emitRuntime's symtab access. Emitted output
byte-identical (tsc --noEmit clean, full corpus gate re-run).
…erge, gate the invariant

The ONE construct in the whole product where a leaf's text was not
source.slice(offset, end) was the yaml flow multi-line plain-scalar merge,
which rewrote the merged token's text to the FOLDED value ('multi line value')
— and whose end (offset + folded length) therefore landed mid-token, a span
that meant nothing. A concrete CST should carry the raw span; the
fold-to-one-space is the scalar's VALUE semantics and belongs to consumers
that resolve values. The merge now slices the raw source span across the run
(end becomes the true end); merge structure, type/key-ness, comment guards and
interning are unchanged. All yaml gates (scope-gap, depth witnesses, issue-12
regressions, generative) pass unchanged.

New gate test/cst-text-invariant.ts (in check.ts): every CST leaf across all
seven grammars must satisfy text === source.slice(offset, end) — the
generative corpus per grammar plus a TS conformance stride sample (43,999
leaves, 0 violations). This invariant licenses dropping the leaf text field
from the CST contract: text is derivable data.
A leaf is now {kind, tokenType, offset, end}. Its text was redundant data —
text === source.slice(offset, end) held everywhere once the yaml flow-merge
anomaly was fixed (previous commit), and the cst-text-invariant gate now pins
the span-only shape itself (no text property, sane spans; 43,999 leaves across
all seven grammars + a TS corpus sample). Consumers derive text from the
source they parsed: both engines export getText(node, source), and the
in-repo consumers (html-conformance tree extraction, the generative net's
leafRoles, gap-ledger probes) take the input alongside the CST.

Both engines change together: matchers and the pratt operator paths stop
materializing text (matchKwLit/matchPuLit lose their value parameter — the
leaf no longer needs it), the no-unary-LHS head check and the markup
open/close tag-name comparisons slice the source span, and the generated
*.cst-types.ts leaf interface drops the field. interp ≡ emit holds on the
full 18,805-file corpus; lexer streams and reject messages identical; 27/27
gates.

Side effect on the PR#4 bench: dropping ~10k leaf-text slices per parse and
shrinking every leaf by a field measures +2.1..+14.7% aggregate, 6/8 runs
positive (mean ~+7%).
@johnsoncodehk

Copy link
Copy Markdown
Owner Author

Round 5 — the CST contract change: leaves are now span-only ({kind, tokenType, offset, end}, no text field).

Sequence:

  1. The one anomaly fixed: the yaml flow multi-line plain-scalar merge rewrote the merged token's text to the FOLDED value, and its end (offset + folded length) landed mid-token — a span that meant nothing. A concrete CST carries the raw span; folding is value semantics and belongs to consumers. All yaml gates pass unchanged.
  2. Invariant gated: test/cst-text-invariant.ts (now in check.ts → 27 gates) — every leaf across all seven grammars satisfies the span-only shape (43,999 leaves: generative corpus per grammar + a TS conformance stride sample, 0 violations).
  3. Contract change: both engines stop materializing leaf text (matchKwLit/matchPuLit lose their value parameter; the no-unary-LHS head check and markup tag-name comparisons slice the source span); generated *.cst-types.ts drop the field; getText(node, source) exported by both engines; in-repo consumers (html-conformance tree extraction, the generative net's leafRoles, gap-ledger probes) take the source alongside the CST.

interp ≡ emit on the full 18,805-file corpus, lexer streams + reject messages identical, 27/27 gates. Bench side effect: ~10k leaf-text slices per parse gone and every leaf one field smaller — +2.1..+14.7% aggregate, 6/8 runs positive (mean ~+7%).

…rally

kind carried zero information — a leaf always has tokenType and a node always
has rule + children, two disjoint field sets, so the tag was derivable from
property presence. A leaf is now {tokenType, offset, end} and a node
{rule, children, offset, end}; consumers discriminate with 'tokenType' in n /
'children' in n (TypeScript property-presence narrowing covers the union), and
the generated *.cst-types.ts drop the field. The cst-text-invariant gate now
pins the full shape on both sides: no kind, no text, leaf/node field sets
exact, sane spans.

In-engine reads (the no-unary-LHS head check, markup tag-name extraction, the
html/coverage tree walkers, exec-trace) switch to structural checks; the
emitted pratt path simplifies (head.tokenType === '$operator' alone — a node's
tokenType is undefined and never matches).

interp ≡ emit on the full 18,805-file corpus, reject messages identical,
27/27 gates. Serialized CSTs shrink by one field per object; bench:
+3.4..+6.7% aggregate, 4/4 runs positive (one fewer slot per object and one
fewer store per build).
…l oracle

test/ts-ast-lowering.ts lowers the Monogram TypeScript CST into an AST whose
node kinds and spans mirror tsc's, written deliberately as a CONSUMER would
write it — every friction met is tagged PAIN(1..19) at the exact site.
test/ts-ast-verify.ts compares the result against the real tsc AST pre-order
(kind as SyntaxKind numbers, getStart/end), over a 30-snippet battery and any
real file. State: 30/30 snippets node-identical to tsc; parserindenter.ts
(35KB) matches on 3,142 nodes with exactly ONE divergence left — and that one
is not lowering pain, it is a parser structure bug the experiment uncovered:

  a == b ? c : d   parses as   a == (b ? c : d)     (tsc: (a == b) ? c : d)
  a + b as T       parses as   a + (b as T)         (tsc: (a + b) as T)
  …same for satisfies — every alternative-form mixfix LED binds maximally
  tight, because the LED loop gates them only on maxBp > minBp: they carry NO
  precedence, so they fire inside any operator's rhs. Invisible to all 27
  gates (accept/reject and token scopes are grouping-blind); the first
  structural oracle caught it immediately.

Not wired into check.ts yet: the gate must stay red until the mixfix-LED
precedence fix lands, then this becomes the parser↔tsc STRUCTURE conformance
gate the product was missing.
…ation

The LED loop gated rule-alternative LEDs only on maxBp > minBp — they carried
no precedence, so they fired inside ANY operator's right operand and bound
maximally tight: a == b ? c : d parsed as a == (b ? c : d), a + b as T as
a + (b as T), and in/instanceof additionally right-chained (a in b in c as
a in (b in c)) because their trailing self-operand re-entered the rule at bp 0.
Invisible to every accept/reject and token-scope gate; found by the AST
dogfood's structural oracle.

New grammar data (NOT a precs-ladder entry — the ladder has seven-plus
scoping/branch consumers across the highlighter generators that must not see
these): grammar.ledPrecs anchors a led's connector to a ladder operator —
sameAs borrows its lbp, below sits one notch under (levels are spaced 2
apart), chainRhs parses the trailing self-operand at that lbp (left-chaining)
instead of as a full expression. ecmaPrec ships ?:(below '??'), in/instanceof
(sameAs '<', chained); TypeScript adds as/satisfies (sameAs '<'). Both engines
resolve identical numbers: the interpreter gates the led and special-cases the
chain rhs via parsePratt(rule, rhsBp); the emitter bakes the lbp into the led
conds and emits a custom chain arm calling R_<rule>_pratt(rhsBp). The no-in
suppress machinery is untouched (the gate is additive on the same led path).

Verified: ts-ast-verify 30/30 snippets AND parserindenter.ts (3,168 nodes)
node-identical to tsc — now wired into check.ts as the parser↔tsc STRUCTURE
conformance gate (28 gates); 18,805-file interp≡emit; reject messages
identical; conformance accept-rate byte-equal to baseline (5386/5659 — the
fix is structure-only).
@johnsoncodehk

Copy link
Copy Markdown
Owner Author

Round 7 — the dogfood paid off twice over:

  1. Parser structure bug found & fixed: alternative-form Pratt LEDs (ternary ?:, as, satisfies, in, instanceof) carried no precedence — they fired inside any operator's rhs and bound maximally tight (a == b ? c : da == (b ? c : d); a in b in c right-chained). Invisible to every accept/reject and token-scope gate. Fix: grammar.ledPrecs anchors a led connector to a ladder operator (sameAs/below, optional chainRhs for left-chaining the trailing self-operand); both engines resolve identical bp numbers. Accept-rate is byte-equal to baseline — the fix is structure-only.
  2. New gate: ts-ast-structure (28 gates now) — a tsc-shaped AST lowering (test/ts-ast-lowering.ts, written as a real consumer with 19 PAIN points tagged in situ) compared node-by-node against the real tsc AST (kind + getStart/end, pre-order). 30-snippet battery + parserindenter.ts (3,168 nodes) are node-identical to tsc. This is the parser↔tsc STRUCTURE conformance measurement the product was missing.

Bug #2 - statement-position function merged with a following call: longest-
match let the expression arm win whenever a member/call tail made it LONGER
("function f(){}" + newline + "(g)()" became ONE IIFE-style expression
statement; tsc keeps a declaration + a separate statement). Fix: the ES2023
14.5 ExpressionStatement lookahead restriction - the expression arm may not
begin with function / async function. "class" is deliberately NOT guarded yet:
the class-DECLARATION arm is narrower than tsc's (extends-expression heritage,
bare ';' class elements, decorator placements), so 31 tsc-valid corpus files
still rely on the class-EXPRESSION fallback - widening the declaration arm is
the named prerequisite. The guard flips exactly 3 corpus files to reject, all
tsc-INVALID (template-typed params, "function* gen" without parens): FN=0
held. Rest parameters also gained their checker-not-parser tail
("...b = init", "...b?: T") so the declaration arm carries what the expression
fallback used to.

Bug #3 (pre-existing, all builds) - bare for-in heads: ForHead's
no-declaration arm parsed its target Expr WITHOUT the no-"in" exclusion, so
"for (key in obj)" swallowed the "in" inside an in-LED, the arm failed, and
the statement fell back to a CALL parse "for(...)" with "for" as an
identifier. Same exclude as binding initializers on the target.

The interpreter's maxPos moves to advance-based tracking (mirroring the
emitted engine's relocation) - the new reject inputs exposed that the two
engines' farthest-position semantics had silently diverged; reject messages
are engine-identical again (468 files).

ts-ast-verify gains the valid-only contract: files where tsc itself reports
parse errors are SKIPPED (error-recovery shapes are each parser's own policy,
not the grammar contract) - parserharness/parserRealSource7 fall under it; the
lowering still gained their missing shapes (NewTarget index form, old-style
type assertion). Gate corpus widened to fixSignatureCaching (2,952 nodes) +
parserindenter (3,168), both node-identical to tsc. 28/28 gates; 18,805-file
interp=emit; conformance 5383/5659 (baseline minus the 3 correct rejects).
src/gen-cst-match.ts is the VALUE-level sibling of gen-ast-types: for every
rule it emits a typed result union ({ arm: 'if_', expr, stmt, stmt2? } | ...)
and a match<Rule>(node, src) that re-derives which grammar alternative a node
matched and binds its children to named fields - the discrimination the parser
performed and the CST does not record. Each alternative compiles to a step
plan (lit / litAlt-capture / tok / node / opt / many / sep / branches) that
renders both the type and a cursor-based unifier; the unifier mirrors the
engine's matcher semantics exactly: tokenType-exact literal checks (the
$keyword-vs-Ident tie facts), the interpolated-template dual (a template token
ref accepts a Template leaf OR a '$template' node), pratt operator forms
(binaryOp/prefixOp/postfixOp synthesized beside led/nud arms), sep()'s
consumed trailing delimiter, and greedy no-backtracking quantifiers (children
always reflect the greedy success path, so local greedy decisions reproduce
the parse). Pure-literal alternations capture the matched text as a
string-literal-union field (alt('let','const','var') becomes
let_Kw: 'let' | 'const' | 'var').

Wired into the gen pipeline: npm run gen emits <grammar>.cst-match.ts for all
seven grammars. New gate test/cst-match-totality.ts (in check.ts, 29 gates):
every node of every generated-corpus CST plus a TS conformance stride sample
must destructure through its rule's matcher with full child consumption -
32,336 nodes, 0 misses on first run.

ts-ast-lowering's statement layer now CONSUMES the generated matcher
(switch on m.arm replaces the hand-probing that PAIN 3/5/7 documented), and
picked up break/continue labels and with-statements in the rewrite;
ts-ast-structure stays node-identical to tsc (32/32).

Cost today (maximal workload - destructuring EVERY node, naive ordered arm
try): +64% over the CST pass; the full tokens->CST->AST pipeline measures
0.96x vs tsc createSourceFile on the four bench files. Known v2 lever: the
dispatcher tries arms in declaration order, so late arms (pratt binaryOp) pay
~20 failed tries - a first-child discriminator switch will cut most nodes to
1-2 tries.
The expression-statement lookahead guard, the bare for-in exclude and the
rest-parameter checker-tail flow into the generated tree-sitter DSL; CI's
artifact-sync step caught that they were not regenerated with b814d96.
The v1 dispatcher tried arms in declaration order, so the late pratt op forms
paid ~20 failed unifications per expression node. v2 derives each arm's
FIRST-CHILD admission keys from its step plan (node rule / leaf tokenType /
literal first charCode, with nullable-first arms admitting everything) and
dispatches through nested switches: node child -> rule bucket, leaf ->
tokenType bucket ($keyword/$punct sub-switched on the first character). Big
node-rule buckets (a pratt rule's self bucket holds every led + op form)
sub-dispatch one level deeper on c[1] - the connector position. Buckets are
SUPERSET filters (every arm fn still verifies exactly) and preserve
declaration order internally, so tie semantics are unchanged.

Destructure-EVERY-node overhead over the bare CST pass: +64% (v1) -> +39%
(one level) -> +31% (two levels); totality unchanged (32,336 nodes, 0 misses
across all seven grammars + the TS corpus sample) and the tsc-structural
oracle stays node-identical (32/32).
…e truth

A longest-match TIE between one-token alternatives goes to the first-listed
one. With the bare-identifier nud listed first, "this"/"true"/"false"/"null"/
"undefined"/"super" in expression position were stamped as Ident leaves and
every consumer had to re-classify them by text (the dogfood's PAIN 19).
Listing the literal alternatives first flips the tie: the leaves arrive as
$keyword and the tree records what the word IS. Accepts are unchanged
(conformance 5383/5659, byte-equal), interp=emit holds on the full corpus,
reject messages identical, and the tsc-structural oracle stays node-identical
- the lowering's text re-classification block is deleted rather than moved.
Two generator gaps the dogfood surfaced:

1. A $keyword literal LEADING an optional group is now captured as <kw>Tok
   (CstLeaf) - the group's presence marker and position anchor in one field.
   The try-arm consumer drops its findText('catch') children scan: catchTok /
   finallyTok / elseTok arrive typed, with spans.

2. Mixed STRUCTURAL alternations no longer flatten to a soup of optionals:
   each branch compiles to a tagged mini-plan with its own captures, and the
   parent gets one field holding a per-branch sub-union -
   alt?: { branch: 'param'; param } | { branch: 'bindingPattern'; ... } -
   so consumers switch on m.alt.branch instead of probing presence. Branch
   capture slots are rename-prefixed per attempt (nested alternations keep
   their slots distinct); the result-object/type keys are frozen at plan time
   (Capture.field), so renaming never leaks into the API. Generated helper
   names moved to a __-prefix namespace after a grammar keyword ('is' in
   TypeScript's `x is T`) minted an isTok capture that shadowed the helper.

Totality holds across all seven grammars (0 misses); the tsc-structural
oracle stays node-identical with two new battery cases (try-finally-only,
catch with a destructuring pattern).
…ass'

The round-8 prerequisite, driven by the 29-file work spec: class DECLARATIONS
now cover what tsc parses cleanly (legality being checker territory), so the
ES2023 14.5 statement lookahead can finally include 'class' - statement-
position class merges (class C {}\n(g)() as one expression) are gone the same
way the function ones were.

Grammar (both languages, TS shown):
- Heritage clauses are REPEATABLE, order-free and multi-typed:
  many(alt(['extends', sep(elem, ',')], ['implements', sep(elem, ',')])) -
  tsc parses "extends A extends B", "implements A extends B", "extends A, B"
  and even a bare "extends {" (zero elements; sep() matches empty, leaving the
  brace for the body) with NO diagnostics.
- A heritage ELEMENT is guarded against the clause keywords (not(alt(
  'extends', 'implements'))): they are contextual words the bare-Ident base
  would otherwise swallow, breaking clause chaining.
- implements elements are heritage EXPRESSIONS, not Types - tsc parses
  "implements A?.B" cleanly; ClassHeritage also gained the '?.' member led
  and the non-constructor primaries (numbers, strings, true/false/null/
  undefined) that classExtendingNonConstructor exercises.
- ClassMember accepts the bare ';' element (tsc's SemicolonClassElement).
- Decorators may precede ANY declaration ([many1(DecoratorExpr), $] - many1
  keeps the self-ref non-left-recursive), covering "@dec export default
  class {}".

The widening also fixed 7 PRE-EXISTING false negatives: total tsc-clean
rejects drop 24 -> 17, conformance holds at 5383/5659, and all 29 of the
guard-caused regression set parse with the guard ON.

gen-tm's dormant optional-chain mint is deleted: its direct-ref shape test
never matched any grammar (the expression '?.' led targets an alt()), and
when the new heritage led first woke it, the minted entity.other.property
scope contradicted the ledger-pinned a?.b contract (variable.other, matching
the official grammar). First activation of dead code = its first review.

interp=emit on the full corpus, reject messages identical, ts-ast oracle
node-identical, 29/29 gates.
…> 0)

Every remaining FN was a construct tsc parses cleanly (validity deferred to
its checker) that the grammar rejected. Widen to the proven parse surface,
verified per construct against tsc parseDiagnostics:

- JSDoc types in normal TS type positions: postfix `T?` / `T!`, prefix `?T` /
  `!T`, bare `?`, `*`, `function(this: T, string): U`, and dotted type
  arguments `Array.<number>`. The postfix `?` mirrors tsc's isStartOfType
  disambiguation via `not(alt('new', $))` (conditional types and `as T ? a : b`
  ternaries keep parsing) and requires same-line (tsc rejects `T\n?`).
- Decorator expressions: optional-chain tails (`@x?.y`, `@x?.()`, `@x?.[i]`),
  tagged templates (`@x``...```), and `@new x` — the lexer maximal-munches
  `@new` into one Decorator token, so isKeywordLiteral now classes `@`-headed
  word literals as keyword-class (text-matched against named tokens; both
  engines share the mechanism, the emitted lexer's kw-intern tables included).
- Decorator placement: `@dec var/let/const x` and `@dec using x` (using
  requires a real binding: `using 1` stays a reject), `export @dec default
  class`.
- Object literals: full modifier soup (`{ static m() {} }`, `{ export p: 1 }`)
  and `?` / `!` after any member name (`{ a! }`, `{ a?() {} }`), per tsc's
  parseObjectLiteralElement; `const`/`default` stay out (tsc parse errors).
- Parameters: full modifier soup (`override`/`static`/`export`/`async`/...,
  the set probed parse-clean one keyword at a time).
- Object binding rest: `...r: name` and `...r = init`.
- Nameless class declarations (`class { }` at statement level).
- Import/export specifiers: `import { "str" as x }`; export side gets its own
  ExportSpecifier (ModuleExportName on both sides, bare strings allowed).
- `typeof import("m").Thing` (ImportTypeNode as a type-query target).
- Async arrows with a bare parameter: `async err => ...` was previously
  mis-split into two statements (`async` + ASI); new same-line arm in both
  grammars (ES2017), with AsyncKeyword-aware lowering and two new structural
  oracle cases (lowered AST stays node-by-node = tsc).

javascript.ts mirrors only the ES-true widenings (async bare arrow, ES2022
string specifiers); tsc-lenient recovery surfaces stay TS-only.

gen-tm: a type-context keyword whose every @type occurrence is followed by
`(` (the JSDoc function type) no longer leaves the statement-start set, so
multi-line type regions still close at a real `function f() {}` declaration
(fixes the issue #1043 regression the new arm introduced).

Conformance: FN 17 -> 0 (corpus-wide; tsc-parse-clean files we reject: none),
both-accept 5095 -> 5112, over-accept 288 -> 289 (the one new case is
exportModifier.2's `abstract @dec class` same-line split — the pre-existing
newline-blind-ASI class, now the dominant remaining over-accept dimension).
29/29 gates green.
…rface

CI's tree-sitter conflict gate (tree-sitter generate, not in the local gate
chain) was red on two counts: the class-declaration widening of the previous
commit (class_member) and this round's object-literal / export-specifier /
decorator / rest-binding / JSDoc-function-type arms. collect-conflicts
fixpoint adds 9 tuples; all seven derived grammars generate cleanly again.
The PR diff was 89% generated *.cst-match.ts (41k of 46k lines), drowning the
hand-written changes. Mark every npm-run-gen artifact linguist-generated so
GitHub collapses them, and add '*.cst-match.ts' to the CI artifact-sync gate —
it was the one committed artifact class the drift check didn't cover (the
totality gate only catches staleness that breaks unification, not silent
drift).
…d in CI

They are pure consumer artifacts derived from the grammar (41k of the PR's
46k added lines); the grammar sources are the truth. Gitignore them, drop
them from the artifact-sync diff and from .gitattributes, and move the gen
step ahead of Typecheck in CI — the typecheck and the gates import the
generated files, so they must exist before either runs.
altMaskDispatch/ftMaskDispatch returned null past 32 alternatives, silently
dropping a rule to serial per-alt membership guards. The FN=0 widening pushed
BOTH hottest rules over the edge (R_Type_lr to 33 alts, R_Expr_pratt's nud
list past 32) and the whole parse paid ~25-27% — a flat profile, because the
cost was diffuse admit-checks, not a hotspot. Masks now span words (alt i ->
word i>>5, bit i&31; one table pair + one local per word), so dispatch
degrades smoothly with grammar growth instead of cliffing; single-word rules
emit byte-identical code to before.

A/B: +37.8% aggregate over the cliffed build; net cost of the entire parse-
surface widening vs its pre-widening baseline is now -2.7%. Prune decisions
are unchanged (mask = the same FIRST/SECOND predicate as the serial guards):
18,805-file emit=interp byte-identical, 436 reject messages exact, 29/29.
@johnsoncodehk johnsoncodehk changed the title Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 0.86x Emitted parser to 0.84x tsc, span-only CST, FN=0 vs tsc's parse surface, generated per-arm destructurers Jun 10, 2026
… labels

test/profile-vs-peers.mjs benchmarks the emitted javascript.ts against acorn
(the ESTree reference) on real-world files from node_modules — acorn parsing
itself, vue compiler-core, parse5's parser, and the 8.9MB typescript.js — and
the emitted html.ts against parse5 (the WHATWG reference) on synthesized
well-formed fragments, same warmup/min-of-rounds methodology as
profile-vs-tsc.mjs. JS lands at parity (0.85-1.1x per file across runs); HTML
at ~2.2x, the named gap being that markup-mode lexing is still the interpreted
lexer (emit-lexer specialization doesn't cover markup yet).

Benching acorn.js exposed a real grammar bug in both ECMAScript grammars:
break/continue labels missed the spec's restricted production
(`break [no LineTerminator here] Label`) and the reserved-word guard, so
`break` newline `case "X":` inside a switch ate `case` as the label and the
whole switch cascaded into a reject. Both arms now read
`opt(sameLine, notReserved, Ident)`. Conformance matrix unchanged
(5112/258/0/289, FN still 0); 29/29 gates; tree-sitter closure complete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant