Emitted parser to 0.84x tsc, span-only CST, FN=0 vs tsc's parse surface, generated per-arm destructurers#35
Emitted parser to 0.84x tsc, span-only CST, FN=0 vs tsc's parse surface, generated per-arm destructurers#35johnsoncodehk wants to merge 42 commits into
Conversation
A standalone, debug-only tracer: startTrace()/endTrace() markers (near-free
no-ops in production, src/trace-markers.ts) bound a region; test/exec-trace.ts
AST-instruments a copy of the target (TS compiler API — no new deps) so every
executed statement/initializer/return/condition records its line, source text,
runtime value and a cost tag {alloc|map.get|call}, gated on the markers. The
flattened executed source is printed with calls inlined and repeated call-frames
grouped by DISTINCT execution path (each path once + count + per-line value
distribution), so a hot function's N calls read as a few paths, not N×.
Scopes: the time-region markers, or --lines=A-B for a spatial slice. Reads the
actual run, so it reflects the real engine, not an approximation.
A negative lookahead over a literal/alternation of keyword literals (e.g. an
identifier that isn't a reserved word, `not('catch'|'delete'|…)`) was matched by
trying each literal in turn — O(N) matchLiteral calls plus an `out` allocation per
arm, at every identifier position (the hottest in the grammar). The keyword set is
static, so collapse it to one membership test: the not fails iff the token is an
ident-kind whose text is one of the keywords. Mirrored in createParser
(gen-parser.ts notKwSet) and emitParser (emit-parser.ts notKwKinds), emitting the
same check matchKwLit uses → byte-identical.
Emitted parser ~5% faster (parserharness 19.7→18.7 ms), interpreter ~9–12%; the
reserved-word scan for `a + b` drops from 12 matchLiteral calls to 0. Byte-identical
CST + accept/reject (createParser ≡ emitParser, run-conformance 5386/5659, 26/26 gates).
…urn directly Two emit-only rewrites (the interpreter is the oracle; emit≡interp is gated byte-identical on the full 18,805-file corpus): - descMatchesTok's switch body is inlined into ruleMightStartDescs and canStartFT, removing the dead helper and the call boundary on the per-token dispatch loops. - Matchers reducible to a single literal/token-ref/rule-ref emit 'const v = match(); return [v]' directly instead of out=[]+push (96 of the TS grammar's matchers specialize). Measured ~1-2% on the PR#4 bench files (parserharness +2.3%).
…loop The longest-match alt guards and rule-ref guards tested FIRST-set membership by looping a descriptor array per call (ruleMightStartDescs) — the CPU profile showed it at ~13% self time. Each set now emits a per-set fn over two Uint8Array tables indexed by the token's baked ints: _qN(tok) = !tok || (KT[tok.k] | TT[tok.t]) !== 0 two loads + an or, no loop. Faithful to the loop because the keyword and punct int ranges are disjoint (the loop's k!==K_PUNCT / k===K_PUNCT guards were redundant) and the punct startsWith arm is enumerated over the closed punct vocabulary at emit time. Tables and fns are deduped across sets (typescript: 103 fns / 103 shared arrays). Gates: full 18,805-file corpus byte-identical, conformance, 26/26 check. Bench (PR#4 files, interleaved best-of-N): aggregate +10.6~12.0%, every file positive (+5~14%).
The CPU profile (vs tsc's scanner) showed the lexer's cost is not the
token regexes but the data-driven dispatch around them. Three prunings,
all grammar-agnostic:
- punct literals: first-char group index. Only literals sharing the
position's first char can startsWith-match, so the longest-first scan
runs over that group instead of all literals (TS: 57 -> ~3).
- token matchers: the template/markup/indent exclusions are
position-independent — filter the matcher list once at createLexer
time instead of re-testing three Set.has per matcher per position.
- whitespace: consume the ASCII \s run ({9..13, 32}) with a char loop;
only a non-ASCII candidate falls back to the \s regex (Unicode
spaces), preserving exact \s semantics and the includes('\n') ->
newlineBefore stamping.
Gates: token-stream equality vs the old lexer over the 5,695-file
conformance corpus (5,624 identical streams + 71 identical-message
throws, 0 diffs), conformance 5386/5659 unchanged, 26/26 check (all
grammars). Bench (PR#4 files): tokenize +87~95%, whole parse +35~46%.
canStartFT — the per-LED single-descriptor switch in the Pratt dispatch loop — was the next profile hotspot (~7% self). A first-token gate is a 1-element FIRST set, so it reuses the membershipFn byte tables, open- coded at the call site (tok is already known non-null there): (KT[tok.k] | TT[tok.t]) !== 0 no call, no switch. The canStartFT runtime helper and the descriptor literals (keyDescLiteral/firstTokDescLiteral) are gone. Gates: full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +7.8~10.3%, every file positive.
…terned fields The no-inlining profile attributed ~14% of parse time to the token interning loop: internTok ADDED k/t to already-shaped lexer tokens (two hidden-class transitions per token, on top of the conditional newlineBefore stamp making shapes polymorphic at every tok.k/tok.t site). The tokenize wrapper now rebuilds each token as a single literal with every field the parser reads (type/text/offset/k/t + the three stamp flags normalized to booleans) — one shape from birth. The matchPuLit '>'-split tokens go through the same mkPunct shape. internTok is gone. Gates: full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +3.1~5.4% across three runs.
The LED loop did a string-keyed opTable.get(tok.text) for every token it reached (and the prefix-op nud a prefixOps.get); tok.t is already interned, so both become an array load (OP_BY_T / PREFIX_BY_T, null- packed, length = the literal-int space). Equivalent because a token's text can equal an operator value only for punct tokens and keyword- shaped idents — exactly the classes tok.t indexes; operator values are in the literal vocabulary by construction (asserted at emit). Gates: full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +1.8~2.8% over three runs.
Every multi-alt site (non-rec alts, left-rec atoms, Pratt nuds) guarded each alternative with its own membership-fn call — R_Stmt burned ~20 _q calls per statement position. Each alt list (3..32 alts) now bakes one bit per alt into two Int32Array mask tables over the token int spaces: mask = startTok ? KM[startTok.k] | TM[startTok.t] : ALL and each alt's guard is one bit test. Bit i is set exactly where the old altGuard was true (always-tried alts in every k slot; EOF admits all), so the same alt subset runs in the same order — byte-identical by construction. _q(startTok) guard calls: 162 -> 15 (the sub-3-alt lists). Gates: full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +10.4~15.2%, every file positive in both runs.
…ation The emitted parser rebuilt every lexer token (second allocation + TYPE_KIND/LIT_KW/LIT_PU dictionary lookups per token) to attach the int kinds — ~21% of parse time after the dispatch rounds, plus the GC share. createLexer now takes an optional intern config (typeKind/kwLit/puLit maps + punct/fallback kinds) and every creation site builds the FINAL token through two single-allocation builders: - mkNamed(type, text, offset, k): k is baked per site — token matchers and prefixed-ident entries carry their kind, fixed-name sites (template/markup/indent/newline) use consts computed once; t is the text's keyword int (one Map.get — the same lookup tsc's getIdentifierToken pays). - mkPu(text, offset, t): punct t is precomputed per first-char-group entry — zero runtime lookup. Stamp flags are real fields from birth (push writes existing fields, no hidden-class transitions). The three post-push mutations re-intern exactly (flow plain-fold merge, markup void retag, unicode ident extension). Without an intern config (the interpreter) k/t are 0 via an empty map; the interpreter's token shape is unchanged in behavior. The emitted tokenize wrapper is gone — the lexer's array is the parser's array; matchPuLit '>'-splits build the same shape. Gates: token-stream equality 5,695 files (0 diffs), full 18,805-file corpus byte-identical, 26/26 check. Bench (strict A/B vs the full previous state): aggregate +12.8~13.0%, every file positive (fixSignatureCaching +19~23%).
emitLexer(grammar) specializes tokenize() for token-stream grammars: one
switch(charCode) whose cases hold exactly that char's candidate token
regexes (declaration order) and a longest-first punct compare-chain with
the literal's int baked at each leaf. Token regexes stay V8 regexes (the
dispatch was the measured cost, not the matching). Regex-vs-division
context, paren-head tracking and the template state machine are baked:
prevIsValue is four int-table loads, the '('/')'/'!' bookkeeping exists
only on those punct branches, scanTemplateSpan/identTextValid are
emitted with their config inlined. Grammars using markup/indent/newline
fall back to the createLexer import (emission unchanged for yaml/html).
New gate test/emit-lexer-verify.ts: emitted stream ≡ createLexer over
the 5,695-file conformance corpus — every field including k/t and the
stamp flags, plus identical error messages (5,624 same + 71 same-throw,
0 diffs).
Gates: full 18,805-file corpus byte-identical, conformance 5386/5659
unchanged, 26/26 check.
Bench: whole parse +57~77%; vs ts.createSourceFile the aggregate is now
1.24x (parse-only) / 0.94x (setParentNodes:true) — below parity on the
latter, with fixSignatureCaching at 0.69x even against parse-only.
|
Added the emitted lexer (f4ceb13): Updated headline (PR #4 bench files, aggregate):
Per file vs parse-only: fixSignatureCaching 0.69×, parserindenter 1.23×, parserRealSource7 1.28×, parserharness 1.45×. Remaining hotspots: R_Expr_pratt ~15%, GC ~18% (CST/token allocation), lexMk ~5%, matchPuLit ~4.6%. |
- The Pratt LED chain re-tested maxBp > minBp per LED and gated each LED with its own two-table load; the shared test is hoisted and the per-LED first-token gates collapse into one mask pair (ftMaskDispatch, same Int32Array machinery as the alt dispatch). - matchKwLit/matchPuLit: the t int ranges are disjoint, so the k >= K_NAMED_MIN / k === K_PUNCT guards were redundant — one int compare each. The '>'-split moves to matchPuLitGT, emitted only at '>' call sites. - Fixes a latent bug the gates could not reach: the generic matchLiteral fallback still indexed LIT_KW/LIT_PU as objects after they became Maps (dead on the TS grammar's fully-specialized sites, wrong if ever reached). Gates: emit-lexer-verify 5,695 files 0 diff, full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +7.5~14.5%, every file positive in both runs.
|
Round 3 (commit ea3f74f): LED chain bitmasked + hoisted maxBp test; matchKwLit/matchPuLit to one int compare (disjoint t ranges make the k guards redundant; '>'-split isolated to matchPuLitGT); fixed a latent Map-indexing bug in the generic matchLiteral fallback. +7.5~14.5%. Aggregate now 9.3ms — 1.20× vs tsc parse-only / 0.87× vs setParentNodes:true (per file vs parse-only: 0.66× / 1.11× / 1.21× / 1.45×). |
- parseRuleEntry looked the memo up through a string-keyed outer Map per rule entry; each memoized rule now owns an array slot allocated at emit time (memo = new Array(MEMO_RULES), hit = memo[idx]). - lexMk did a LIT_KW.get for every named token; a matcher (or template / prefixed-ident site) whose first-char set is disjoint from every keyword's first char provably interns t = 0 — those sites emit a lookup-free builder (lexMk0). The identifier matcher keeps the lookup (the same one tsc's getIdentifierToken pays). Gates: emit-lexer-verify 5,695 files 0 diff (k/t compared), full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +1.4~5.8% across four runs (machine noisy; sign stable).
|
Round 4 (34bb02d): memo keyed by emit-time rule index (array slot instead of string-keyed Map per rule entry); lexMk bakes t:0 for matchers whose first-char set provably can't start a keyword (lookup only remains on identifiers, mirroring tsc's getIdentifierToken). +1.4~5.8% across four runs. Cumulative: 34.8ms → ~9ms, 4.41× → ~1.2× vs tsc parse-only, ~0.85× vs setParentNodes:true. |
…ness - test/profile-vs-tsc.mjs: dual V8 inspector profiles (emitted parser / ts.createSourceFile) with the timing ratio table, lexer/parser/GC layer split and top self-time tables; run with --no-turbo-inlining for honest per-function attribution. - test/profile-lines.mjs: positionTicks per (file,line) over a saved .cpuprofile, printed with source text — the op-level view. - test/ab-emitted.mjs: the interleaved best-of-N A/B gate used to keep/revert every lever in PR #35.
tokenPatternCharLoop derives a charCode loop plan from seq(first, star(cont))- shaped token IR: scan the plain continuation class with int compares, fall back to the full regex only when the stop char could begin a complex alternative (an escape opener). Exact by construction: plain chars and every complex alternative's first chars are required disjoint, so the greedy star is deterministic and a non-bail stop char provably ends the regex match too. Kills the regex exec + match-array allocation on the hottest token class (identifiers, ~40% of tokens); the escape-validation call is skipped on the fast path since the loop proved no backslash. A/B on the PR#4 bench: +1.7% to +7.8% aggregate across 4 runs, all positive; emit-lexer-verify (5,695 files, all fields + error messages), 18,805-file byte-identical gate and 26/26 checks pass.
peek() paid a parseLimit branch plus a ?? undefined-check on every call for a cap that only two emitted mixfix re-parse sites ever set. Maintain cap = min(parseLimit-or-infinity, tokens.length) at those sites, the '>'-splice and parse() entry instead; peek is now one compare and an always-real token load. (The ternary form measured NEGATIVE -2.5..-5.6% — V8 prefers the early-return branch shape; the if-form measures +1.4..+5.4%, 5/6 runs positive.) Byte-identical on the 18,805-file corpus, 26/26 gates.
lexMk paid a Map.get(text) per named token — a string hash over a fresh slice every time (V8 can't reuse a cached hash on a new string). Emit lexKwT from the keyword symtab instead: length window, first-charCode switch, then per-keyword compare chains shortest-first. Returns exactly LIT_KW.get(text) ?? 0 by complete enumeration; non-keyword identifiers exit at the length window or the switch default in a couple of int compares. Also replaces the re-intern in the unicode ident-extension path. A/B: +9.3..+10.5% aggregate over 3 runs (parserharness +16% — identifier-dense); lexer-verify, 18,805-file byte-identical and 26/26 gates pass.
parseRuleEntry paid a Map hash per entry and a {node, end} wrapper allocation
per store, at a measured 51.8% hit rate (memo is load-bearing — half the
entries are avoided longest-match re-parses). Replace each rule's Map with a
pair of arrays indexed by start pos, lazily sized to the token count: a lookup
is two undefined-sentinel array loads, a store allocates nothing.
The pair is captured together at entry and created together at store time —
a '>'-splice inside core() detaches both via fill(undefined) and the late store
lands in the detached pair (discarded), exactly the old Map's semantics; the
first cut captured only one side and crashed on memoNode[idx] === undefined
after an in-flight splice (caught by the 18,805-file gate, 318 divergences).
A/B: +6.8..+13.7% aggregate over 3 runs; byte-identical corpus and 26/26 gates
pass.
peek() updated the farthest-position high-water mark on every call — the single hottest line in the profile (3.6% self) — for state that only error messages read. Track it at pos-advance sites instead (one compare per consumed token rather than per lookahead); restores only ever lower pos, and a memo-hit restore needs no update since the stored end was recorded when first reached. Error text is now pinned by a new gate, test/emit-reject-messages.ts: over the full conformance corpus every both-reject file must throw the EXACT interpreter message (466 files, 0 mismatches before and after). A/B: +1.1..+7.5% aggregate, 4/4 runs positive; byte-identical corpus and 26/26 gates pass.
…le bits
A token-name key gathered under a not(alt('if', 'var', ...)) guard immediately
before the first consuming element becomes a qualified key, emitted as TM[0]
plus every keyword t outside the guard class instead of the blanket k-bit — a
guarded-out keyword lookahead no longer admits alternatives it provably cannot
start (it would fail the not-guard before consuming). Pure keyword literals
inside their own guard class are dropped from the set entirely. Sharpens the
identifier-led bits of the Expr bare-ident nud, the object-literal shorthand
and class-name alternatives, and every _qN entry guard derived from them.
A/B: +2.1..+10.7%, 3/4 runs positive; byte-identical corpus, reject messages
identical, 26/26 gates.
… literals exprFirst collapsed any rule whose body contains a [prefix, operand] form to null/always-admit — FIRST(Expr) was unknown, so every expression-led alternative and every Expr rule-entry guard admitted everything. A prefix item consumes exactly one of the prefix-operator literals, so contribute those and stop. The many(Stmt) loop guard now rejects '}' before entering R_Stmt at all (1,790 dispatches with their arm runs gone per bench round; R_Stmt admitted arms 16,052 → 14,260). A/B neutral on its own (-0.3/+0.1%) — kept as a strict table sharpening and the prerequisite for second-token dispatch refinement. Byte-identical corpus; 26/26 gates.
Compute per-alternative SECOND sets (the keys admissible as a match's second token, plus whether a one-token match exists): an admitted alternative whose SECOND set excludes the actual second token — and that cannot end after one — provably fails, so its arm is skipped. Kills the labeled-statement arm at every identifier-led statement without a ':' second token, the single-param arrow head without '=>', and friends: R_Stmt admitted arms 14,260 → 9,560 and nud arms 13,537 → 8,968 per bench round. The emitter folds it into the existing alt-mask dispatch as a second pair of Int32 tables ANDed in (alts with unknown/len1/nullable/empty SECOND keep their bit everywhere and in the EOF-after-one mask); pratt op/prefix/postfix items contribute their operator literal sets, and a '>' SECOND key admits every '>'-led punct so the '>'-splice stays covered. The pruning must be ENGINE-IDENTICAL: the first cut (emitted only) passed the byte-identical corpus but tripped the reject-message gate — a skipped arm no longer advanced the farthest-position error state that the interpreter's run of the same arm did. So gen-parser gets the same analysis as altMightSecond checks in its three dispatch loops, and the emit side computes SECOND from PLAIN FIRST inputs (no reserved-qualified keys, prefix-to-top) so both engines derive identical sets and identical prune decisions by construction. A/B: +2.3..+15.6%, 6/7 runs positive; 18,805-file byte-identical, reject messages identical (466 files), 26/26 gates.
|
Round 3 (post-compaction): aggregate 0.86× vs tsc parse-only / 0.64× vs setParentNodes=true (round start: 1.17× / 0.86×). parserharness — the expression-dense laggard — is now at 1.00×. Levers this round, each A/B-gated (byte-identical 18,805-file corpus + 26/26 gates + new reject-message gate):
|
Inspired by the alien-signals / TSSLint performance notes (object shapes, in-object property pressure, lazy materialization): the token stream becomes five parallel columns — kind, literal id, span start/end, stamp bits — written directly by the emitted tokenize. Token text is never materialized during lexing; the baked keyword recognizer reads source charCodes over the span, and a CST leaf slices the span only when it is built. peek() disappears entirely (pos indexes the columns; matchers and the byte-table guards read tkK/tkT directly), the '>'-split shifts the columns in place, and the column element width is chosen at emit time (Uint8 when the id spaces fit a byte). Grammars on the createLexer fallback (markup/indent/newline) convert the object stream into the same columns at parse() entry with a text column, so the runtime has a single form; tokenAt(i) reconstructs the object view for the lexer gate. Measured: GC layer 17.6% -> 14.0%, the peek frame (6.2% self) and the token constructors (lexMk/lexMkPu/push, ~4%) gone; wall time ~flat to +1.5% on the bench (the earlier born-final/char-loop/lexKwT rounds had already removed most lexer-side allocation cost, and per-matcher typed-array loads offset the rest). Kept as the structural base: the remaining GC is parser-side CST building (loser arms), which needs probe-then-build, not token shape. All gates green: lexer stream identical (5,695 files), 18,805-file byte-identical corpus, reject messages identical, 26/26.
classifyKey's 'tok' variant carries no t — narrow the mixfix separator (a literal, so kw/punct in practice; -1 never-matches defensively), and Emitter.a becomes readonly for emitRuntime's symtab access. Emitted output byte-identical (tsc --noEmit clean, full corpus gate re-run).
…erge, gate the invariant
The ONE construct in the whole product where a leaf's text was not
source.slice(offset, end) was the yaml flow multi-line plain-scalar merge,
which rewrote the merged token's text to the FOLDED value ('multi line value')
— and whose end (offset + folded length) therefore landed mid-token, a span
that meant nothing. A concrete CST should carry the raw span; the
fold-to-one-space is the scalar's VALUE semantics and belongs to consumers
that resolve values. The merge now slices the raw source span across the run
(end becomes the true end); merge structure, type/key-ness, comment guards and
interning are unchanged. All yaml gates (scope-gap, depth witnesses, issue-12
regressions, generative) pass unchanged.
New gate test/cst-text-invariant.ts (in check.ts): every CST leaf across all
seven grammars must satisfy text === source.slice(offset, end) — the
generative corpus per grammar plus a TS conformance stride sample (43,999
leaves, 0 violations). This invariant licenses dropping the leaf text field
from the CST contract: text is derivable data.
A leaf is now {kind, tokenType, offset, end}. Its text was redundant data —
text === source.slice(offset, end) held everywhere once the yaml flow-merge
anomaly was fixed (previous commit), and the cst-text-invariant gate now pins
the span-only shape itself (no text property, sane spans; 43,999 leaves across
all seven grammars + a TS corpus sample). Consumers derive text from the
source they parsed: both engines export getText(node, source), and the
in-repo consumers (html-conformance tree extraction, the generative net's
leafRoles, gap-ledger probes) take the input alongside the CST.
Both engines change together: matchers and the pratt operator paths stop
materializing text (matchKwLit/matchPuLit lose their value parameter — the
leaf no longer needs it), the no-unary-LHS head check and the markup
open/close tag-name comparisons slice the source span, and the generated
*.cst-types.ts leaf interface drops the field. interp ≡ emit holds on the
full 18,805-file corpus; lexer streams and reject messages identical; 27/27
gates.
Side effect on the PR#4 bench: dropping ~10k leaf-text slices per parse and
shrinking every leaf by a field measures +2.1..+14.7% aggregate, 6/8 runs
positive (mean ~+7%).
|
Round 5 — the CST contract change: leaves are now span-only ( Sequence:
interp ≡ emit on the full 18,805-file corpus, lexer streams + reject messages identical, 27/27 gates. Bench side effect: ~10k leaf-text slices per parse gone and every leaf one field smaller — +2.1..+14.7% aggregate, 6/8 runs positive (mean ~+7%). |
…rally
kind carried zero information — a leaf always has tokenType and a node always
has rule + children, two disjoint field sets, so the tag was derivable from
property presence. A leaf is now {tokenType, offset, end} and a node
{rule, children, offset, end}; consumers discriminate with 'tokenType' in n /
'children' in n (TypeScript property-presence narrowing covers the union), and
the generated *.cst-types.ts drop the field. The cst-text-invariant gate now
pins the full shape on both sides: no kind, no text, leaf/node field sets
exact, sane spans.
In-engine reads (the no-unary-LHS head check, markup tag-name extraction, the
html/coverage tree walkers, exec-trace) switch to structural checks; the
emitted pratt path simplifies (head.tokenType === '$operator' alone — a node's
tokenType is undefined and never matches).
interp ≡ emit on the full 18,805-file corpus, reject messages identical,
27/27 gates. Serialized CSTs shrink by one field per object; bench:
+3.4..+6.7% aggregate, 4/4 runs positive (one fewer slot per object and one
fewer store per build).
…l oracle test/ts-ast-lowering.ts lowers the Monogram TypeScript CST into an AST whose node kinds and spans mirror tsc's, written deliberately as a CONSUMER would write it — every friction met is tagged PAIN(1..19) at the exact site. test/ts-ast-verify.ts compares the result against the real tsc AST pre-order (kind as SyntaxKind numbers, getStart/end), over a 30-snippet battery and any real file. State: 30/30 snippets node-identical to tsc; parserindenter.ts (35KB) matches on 3,142 nodes with exactly ONE divergence left — and that one is not lowering pain, it is a parser structure bug the experiment uncovered: a == b ? c : d parses as a == (b ? c : d) (tsc: (a == b) ? c : d) a + b as T parses as a + (b as T) (tsc: (a + b) as T) …same for satisfies — every alternative-form mixfix LED binds maximally tight, because the LED loop gates them only on maxBp > minBp: they carry NO precedence, so they fire inside any operator's rhs. Invisible to all 27 gates (accept/reject and token scopes are grouping-blind); the first structural oracle caught it immediately. Not wired into check.ts yet: the gate must stay red until the mixfix-LED precedence fix lands, then this becomes the parser↔tsc STRUCTURE conformance gate the product was missing.
…ation The LED loop gated rule-alternative LEDs only on maxBp > minBp — they carried no precedence, so they fired inside ANY operator's right operand and bound maximally tight: a == b ? c : d parsed as a == (b ? c : d), a + b as T as a + (b as T), and in/instanceof additionally right-chained (a in b in c as a in (b in c)) because their trailing self-operand re-entered the rule at bp 0. Invisible to every accept/reject and token-scope gate; found by the AST dogfood's structural oracle. New grammar data (NOT a precs-ladder entry — the ladder has seven-plus scoping/branch consumers across the highlighter generators that must not see these): grammar.ledPrecs anchors a led's connector to a ladder operator — sameAs borrows its lbp, below sits one notch under (levels are spaced 2 apart), chainRhs parses the trailing self-operand at that lbp (left-chaining) instead of as a full expression. ecmaPrec ships ?:(below '??'), in/instanceof (sameAs '<', chained); TypeScript adds as/satisfies (sameAs '<'). Both engines resolve identical numbers: the interpreter gates the led and special-cases the chain rhs via parsePratt(rule, rhsBp); the emitter bakes the lbp into the led conds and emits a custom chain arm calling R_<rule>_pratt(rhsBp). The no-in suppress machinery is untouched (the gate is additive on the same led path). Verified: ts-ast-verify 30/30 snippets AND parserindenter.ts (3,168 nodes) node-identical to tsc — now wired into check.ts as the parser↔tsc STRUCTURE conformance gate (28 gates); 18,805-file interp≡emit; reject messages identical; conformance accept-rate byte-equal to baseline (5386/5659 — the fix is structure-only).
|
Round 7 — the dogfood paid off twice over:
|
Bug #2 - statement-position function merged with a following call: longest- match let the expression arm win whenever a member/call tail made it LONGER ("function f(){}" + newline + "(g)()" became ONE IIFE-style expression statement; tsc keeps a declaration + a separate statement). Fix: the ES2023 14.5 ExpressionStatement lookahead restriction - the expression arm may not begin with function / async function. "class" is deliberately NOT guarded yet: the class-DECLARATION arm is narrower than tsc's (extends-expression heritage, bare ';' class elements, decorator placements), so 31 tsc-valid corpus files still rely on the class-EXPRESSION fallback - widening the declaration arm is the named prerequisite. The guard flips exactly 3 corpus files to reject, all tsc-INVALID (template-typed params, "function* gen" without parens): FN=0 held. Rest parameters also gained their checker-not-parser tail ("...b = init", "...b?: T") so the declaration arm carries what the expression fallback used to. Bug #3 (pre-existing, all builds) - bare for-in heads: ForHead's no-declaration arm parsed its target Expr WITHOUT the no-"in" exclusion, so "for (key in obj)" swallowed the "in" inside an in-LED, the arm failed, and the statement fell back to a CALL parse "for(...)" with "for" as an identifier. Same exclude as binding initializers on the target. The interpreter's maxPos moves to advance-based tracking (mirroring the emitted engine's relocation) - the new reject inputs exposed that the two engines' farthest-position semantics had silently diverged; reject messages are engine-identical again (468 files). ts-ast-verify gains the valid-only contract: files where tsc itself reports parse errors are SKIPPED (error-recovery shapes are each parser's own policy, not the grammar contract) - parserharness/parserRealSource7 fall under it; the lowering still gained their missing shapes (NewTarget index form, old-style type assertion). Gate corpus widened to fixSignatureCaching (2,952 nodes) + parserindenter (3,168), both node-identical to tsc. 28/28 gates; 18,805-file interp=emit; conformance 5383/5659 (baseline minus the 3 correct rejects).
src/gen-cst-match.ts is the VALUE-level sibling of gen-ast-types: for every
rule it emits a typed result union ({ arm: 'if_', expr, stmt, stmt2? } | ...)
and a match<Rule>(node, src) that re-derives which grammar alternative a node
matched and binds its children to named fields - the discrimination the parser
performed and the CST does not record. Each alternative compiles to a step
plan (lit / litAlt-capture / tok / node / opt / many / sep / branches) that
renders both the type and a cursor-based unifier; the unifier mirrors the
engine's matcher semantics exactly: tokenType-exact literal checks (the
$keyword-vs-Ident tie facts), the interpolated-template dual (a template token
ref accepts a Template leaf OR a '$template' node), pratt operator forms
(binaryOp/prefixOp/postfixOp synthesized beside led/nud arms), sep()'s
consumed trailing delimiter, and greedy no-backtracking quantifiers (children
always reflect the greedy success path, so local greedy decisions reproduce
the parse). Pure-literal alternations capture the matched text as a
string-literal-union field (alt('let','const','var') becomes
let_Kw: 'let' | 'const' | 'var').
Wired into the gen pipeline: npm run gen emits <grammar>.cst-match.ts for all
seven grammars. New gate test/cst-match-totality.ts (in check.ts, 29 gates):
every node of every generated-corpus CST plus a TS conformance stride sample
must destructure through its rule's matcher with full child consumption -
32,336 nodes, 0 misses on first run.
ts-ast-lowering's statement layer now CONSUMES the generated matcher
(switch on m.arm replaces the hand-probing that PAIN 3/5/7 documented), and
picked up break/continue labels and with-statements in the rewrite;
ts-ast-structure stays node-identical to tsc (32/32).
Cost today (maximal workload - destructuring EVERY node, naive ordered arm
try): +64% over the CST pass; the full tokens->CST->AST pipeline measures
0.96x vs tsc createSourceFile on the four bench files. Known v2 lever: the
dispatcher tries arms in declaration order, so late arms (pratt binaryOp) pay
~20 failed tries - a first-child discriminator switch will cut most nodes to
1-2 tries.
The expression-statement lookahead guard, the bare for-in exclude and the rest-parameter checker-tail flow into the generated tree-sitter DSL; CI's artifact-sync step caught that they were not regenerated with b814d96.
The v1 dispatcher tried arms in declaration order, so the late pratt op forms paid ~20 failed unifications per expression node. v2 derives each arm's FIRST-CHILD admission keys from its step plan (node rule / leaf tokenType / literal first charCode, with nullable-first arms admitting everything) and dispatches through nested switches: node child -> rule bucket, leaf -> tokenType bucket ($keyword/$punct sub-switched on the first character). Big node-rule buckets (a pratt rule's self bucket holds every led + op form) sub-dispatch one level deeper on c[1] - the connector position. Buckets are SUPERSET filters (every arm fn still verifies exactly) and preserve declaration order internally, so tie semantics are unchanged. Destructure-EVERY-node overhead over the bare CST pass: +64% (v1) -> +39% (one level) -> +31% (two levels); totality unchanged (32,336 nodes, 0 misses across all seven grammars + the TS corpus sample) and the tsc-structural oracle stays node-identical (32/32).
…e truth A longest-match TIE between one-token alternatives goes to the first-listed one. With the bare-identifier nud listed first, "this"/"true"/"false"/"null"/ "undefined"/"super" in expression position were stamped as Ident leaves and every consumer had to re-classify them by text (the dogfood's PAIN 19). Listing the literal alternatives first flips the tie: the leaves arrive as $keyword and the tree records what the word IS. Accepts are unchanged (conformance 5383/5659, byte-equal), interp=emit holds on the full corpus, reject messages identical, and the tsc-structural oracle stays node-identical - the lowering's text re-classification block is deleted rather than moved.
Two generator gaps the dogfood surfaced:
1. A $keyword literal LEADING an optional group is now captured as <kw>Tok
(CstLeaf) - the group's presence marker and position anchor in one field.
The try-arm consumer drops its findText('catch') children scan: catchTok /
finallyTok / elseTok arrive typed, with spans.
2. Mixed STRUCTURAL alternations no longer flatten to a soup of optionals:
each branch compiles to a tagged mini-plan with its own captures, and the
parent gets one field holding a per-branch sub-union -
alt?: { branch: 'param'; param } | { branch: 'bindingPattern'; ... } -
so consumers switch on m.alt.branch instead of probing presence. Branch
capture slots are rename-prefixed per attempt (nested alternations keep
their slots distinct); the result-object/type keys are frozen at plan time
(Capture.field), so renaming never leaks into the API. Generated helper
names moved to a __-prefix namespace after a grammar keyword ('is' in
TypeScript's `x is T`) minted an isTok capture that shadowed the helper.
Totality holds across all seven grammars (0 misses); the tsc-structural
oracle stays node-identical with two new battery cases (try-finally-only,
catch with a destructuring pattern).
…ass'
The round-8 prerequisite, driven by the 29-file work spec: class DECLARATIONS
now cover what tsc parses cleanly (legality being checker territory), so the
ES2023 14.5 statement lookahead can finally include 'class' - statement-
position class merges (class C {}\n(g)() as one expression) are gone the same
way the function ones were.
Grammar (both languages, TS shown):
- Heritage clauses are REPEATABLE, order-free and multi-typed:
many(alt(['extends', sep(elem, ',')], ['implements', sep(elem, ',')])) -
tsc parses "extends A extends B", "implements A extends B", "extends A, B"
and even a bare "extends {" (zero elements; sep() matches empty, leaving the
brace for the body) with NO diagnostics.
- A heritage ELEMENT is guarded against the clause keywords (not(alt(
'extends', 'implements'))): they are contextual words the bare-Ident base
would otherwise swallow, breaking clause chaining.
- implements elements are heritage EXPRESSIONS, not Types - tsc parses
"implements A?.B" cleanly; ClassHeritage also gained the '?.' member led
and the non-constructor primaries (numbers, strings, true/false/null/
undefined) that classExtendingNonConstructor exercises.
- ClassMember accepts the bare ';' element (tsc's SemicolonClassElement).
- Decorators may precede ANY declaration ([many1(DecoratorExpr), $] - many1
keeps the self-ref non-left-recursive), covering "@dec export default
class {}".
The widening also fixed 7 PRE-EXISTING false negatives: total tsc-clean
rejects drop 24 -> 17, conformance holds at 5383/5659, and all 29 of the
guard-caused regression set parse with the guard ON.
gen-tm's dormant optional-chain mint is deleted: its direct-ref shape test
never matched any grammar (the expression '?.' led targets an alt()), and
when the new heritage led first woke it, the minted entity.other.property
scope contradicted the ledger-pinned a?.b contract (variable.other, matching
the official grammar). First activation of dead code = its first review.
interp=emit on the full corpus, reject messages identical, ts-ast oracle
node-identical, 29/29 gates.
…> 0)
Every remaining FN was a construct tsc parses cleanly (validity deferred to
its checker) that the grammar rejected. Widen to the proven parse surface,
verified per construct against tsc parseDiagnostics:
- JSDoc types in normal TS type positions: postfix `T?` / `T!`, prefix `?T` /
`!T`, bare `?`, `*`, `function(this: T, string): U`, and dotted type
arguments `Array.<number>`. The postfix `?` mirrors tsc's isStartOfType
disambiguation via `not(alt('new', $))` (conditional types and `as T ? a : b`
ternaries keep parsing) and requires same-line (tsc rejects `T\n?`).
- Decorator expressions: optional-chain tails (`@x?.y`, `@x?.()`, `@x?.[i]`),
tagged templates (`@x``...```), and `@new x` — the lexer maximal-munches
`@new` into one Decorator token, so isKeywordLiteral now classes `@`-headed
word literals as keyword-class (text-matched against named tokens; both
engines share the mechanism, the emitted lexer's kw-intern tables included).
- Decorator placement: `@dec var/let/const x` and `@dec using x` (using
requires a real binding: `using 1` stays a reject), `export @dec default
class`.
- Object literals: full modifier soup (`{ static m() {} }`, `{ export p: 1 }`)
and `?` / `!` after any member name (`{ a! }`, `{ a?() {} }`), per tsc's
parseObjectLiteralElement; `const`/`default` stay out (tsc parse errors).
- Parameters: full modifier soup (`override`/`static`/`export`/`async`/...,
the set probed parse-clean one keyword at a time).
- Object binding rest: `...r: name` and `...r = init`.
- Nameless class declarations (`class { }` at statement level).
- Import/export specifiers: `import { "str" as x }`; export side gets its own
ExportSpecifier (ModuleExportName on both sides, bare strings allowed).
- `typeof import("m").Thing` (ImportTypeNode as a type-query target).
- Async arrows with a bare parameter: `async err => ...` was previously
mis-split into two statements (`async` + ASI); new same-line arm in both
grammars (ES2017), with AsyncKeyword-aware lowering and two new structural
oracle cases (lowered AST stays node-by-node = tsc).
javascript.ts mirrors only the ES-true widenings (async bare arrow, ES2022
string specifiers); tsc-lenient recovery surfaces stay TS-only.
gen-tm: a type-context keyword whose every @type occurrence is followed by
`(` (the JSDoc function type) no longer leaves the statement-start set, so
multi-line type regions still close at a real `function f() {}` declaration
(fixes the issue #1043 regression the new arm introduced).
Conformance: FN 17 -> 0 (corpus-wide; tsc-parse-clean files we reject: none),
both-accept 5095 -> 5112, over-accept 288 -> 289 (the one new case is
exportModifier.2's `abstract @dec class` same-line split — the pre-existing
newline-blind-ASI class, now the dominant remaining over-accept dimension).
29/29 gates green.
…rface CI's tree-sitter conflict gate (tree-sitter generate, not in the local gate chain) was red on two counts: the class-declaration widening of the previous commit (class_member) and this round's object-literal / export-specifier / decorator / rest-binding / JSDoc-function-type arms. collect-conflicts fixpoint adds 9 tuples; all seven derived grammars generate cleanly again.
The PR diff was 89% generated *.cst-match.ts (41k of 46k lines), drowning the hand-written changes. Mark every npm-run-gen artifact linguist-generated so GitHub collapses them, and add '*.cst-match.ts' to the CI artifact-sync gate — it was the one committed artifact class the drift check didn't cover (the totality gate only catches staleness that breaks unification, not silent drift).
…d in CI They are pure consumer artifacts derived from the grammar (41k of the PR's 46k added lines); the grammar sources are the truth. Gitignore them, drop them from the artifact-sync diff and from .gitattributes, and move the gen step ahead of Typecheck in CI — the typecheck and the gates import the generated files, so they must exist before either runs.
altMaskDispatch/ftMaskDispatch returned null past 32 alternatives, silently dropping a rule to serial per-alt membership guards. The FN=0 widening pushed BOTH hottest rules over the edge (R_Type_lr to 33 alts, R_Expr_pratt's nud list past 32) and the whole parse paid ~25-27% — a flat profile, because the cost was diffuse admit-checks, not a hotspot. Masks now span words (alt i -> word i>>5, bit i&31; one table pair + one local per word), so dispatch degrades smoothly with grammar growth instead of cliffing; single-word rules emit byte-identical code to before. A/B: +37.8% aggregate over the cliffed build; net cost of the entire parse- surface widening vs its pre-widening baseline is now -2.7%. Prune decisions are unchanged (mask = the same FIRST/SECOND predicate as the serial guards): 18,805-file emit=interp byte-identical, 436 reject messages exact, 29/29.
… labels test/profile-vs-peers.mjs benchmarks the emitted javascript.ts against acorn (the ESTree reference) on real-world files from node_modules — acorn parsing itself, vue compiler-core, parse5's parser, and the 8.9MB typescript.js — and the emitted html.ts against parse5 (the WHATWG reference) on synthesized well-formed fragments, same warmup/min-of-rounds methodology as profile-vs-tsc.mjs. JS lands at parity (0.85-1.1x per file across runs); HTML at ~2.2x, the named gap being that markup-mode lexing is still the interpreted lexer (emit-lexer specialization doesn't cover markup yet). Benching acorn.js exposed a real grammar bug in both ECMAScript grammars: break/continue labels missed the spec's restricted production (`break [no LineTerminator here] Label`) and the reserved-word guard, so `break` newline `case "X":` inside a switch ate `case` as the label and the whole switch cascaded into a reject. Both arms now read `opt(sameLine, notReserved, Ident)`. Conformance matrix unchanged (5112/258/0/289, FN still 0); 29/29 gates; tree-sitter closure complete.
Four workstreams on the #8 perf line, each gated end-to-end. Net state: the emitted parser parses the bench corpus at 0.84× tsc's parse-only time (0.64× vs
setParentNodes: true), the CST contract is span-only minimal, acceptance vs tsc's real parse surface has zero false negatives corpus-wide, and consumers destructure CST nodes through generated, typed per-arm matchers instead of child-shape probing.Performance: 4.41× → 0.84× vs tsc parse-only
Profiled with V8 inspector profiles (layer split, line ticks, a
--no-turbo-inliningpass for honest attribution), one gated commit per lever. The line profile is now flat — top line 3.8% — i.e. the JS-op level is harvested; the remaining gap layers are GC (14%, the kept CST itself) and the lexer floor, both representation/substrate questions, not op-shaving.Lever groups (full list in the commit log):
lexKwTbaked keyword recognizer over source spans (no slice, no hash); tokens born final (k/t interned at creation); a dedicated emitted lexer (src/emit-lexer.ts).tkK/tkTbytes,tkOff/tkEndInt32,tkFlbits) written by the emitted tokenize; token text is never materialized — keyword matching reads source charCodes,peek()is gone,'>'-resplit is acopyWithin.TypeandExprcrossed it, costing ~25% whole-parse with a perfectly flat profile); masks now span words, so dispatch degrades smoothly with grammar growth.maxPos.Peers: the reference parser of each ecosystem
test/profile-vs-peers.mjs, same methodology. JavaScript vs acorn (the ESTree reference) on real-world files — acorn parsing itself, vue compiler-core, parse5's parser, the 8.9MBtypescript.jsbundle — lands at parity (0.85–1.1× per file across runs; the 9MB row swings ±10% run-to-run):HTML vs parse5 (the WHATWG reference) on synthesized well-formed fragments: ~2.2× — the named gap is that markup-mode lexing still runs the interpreted lexer (emit-lexer specialization covers the token-stream grammars, not markup mode yet). Output shapes differ by design (full CST with every token as a leaf vs ESTree AST vs DOM-shaped tree), so both tables are parse-to-tree wall time on identical inputs.
Benching acorn.js exposed a real grammar bug:
break/continuelabels missed the spec's restricted production (break [no LineTerminator here] Label) and the reserved-word guard, sobreak⏎case "X":inside a switch atecaseas a label and cascaded into a reject. Both ECMAScript grammars now guard the label with(sameLine, notReserved).Measured-flat-or-negative and reverted, for the record: persistent memo arrays, string quote-loop, switch-direct dispatch, probe-then-build (rejected by design analysis — memo shares loser-arm subtrees with the winner, so the discard waste is only 3-5%).
CST contract: span-only, two shapes, nothing derivable
textandkindare gone (both derivable; both measured as pure overhead — dropping them was +7% and +5.5%).getText(node, source)is exported by both engines.test/cst-text-invariant.tspins the exact shape over all seven grammars (generative corpora + a TS corpus stride).Acceptance: FN = 0 against tsc's parse surface
Measured bidirectionally over the full conformance corpus (5,659 files, monogram verdict × tsc
parseDiagnostics): both-accept 5,112 · both-reject 258 · false negatives 0 · over-accepts 289. Every construct tsc parses cleanly now parses — each residual FN was probed against tsc and widened to the proven surface: JSDoc types in normal TS positions (T?/T!/?T/!T/*/function(...)/Array.<number>, with tsc's exactisStartOfTypedisambiguation on postfix?), decorator expressions (@x?.y, tagged templates,@new x), decorator placement (@dec var x,export @dec default class), object-literal and parameter modifier soups,?/!after object member names, rest-binding...r: n, nameless class declarations, string import/export specifiers,typeof import("m").Thing, and async arrows with a bare parameter (async err => …previously mis-split into two statements — a structural bug that accept-level metrics could not see).Three more structure bugs surfaced by the new structural oracle and fixed: mixfix-LED precedence (
grammar.ledPrecs, an orthogonal precedence field so seven highlighter consumers never see it), the ES2023 §14.5 expression-statement lookahead (function/async function/class), and barefor (x in y)(theexclude('in', …)no-incontext).The dominant remaining over-accept class is newline-blind ASI (same-line statement splits tsc rejects) — a future round of its own.
Generated consumer toolkit
npm run genemits, per grammar,<g>.cst-types.ts(typed node unions) and<g>.cst-match.ts(per-arm destructurers):matchStmt(n, src)returns a tagged{ arm, …named fields }union, so consumers writeswitch (m.arm)instead of probing children. Matcher semantics mirror the engine exactly (literal token kinds, sep trailing delimiters, template duals, op forms) andtest/cst-match-totality.tsproves totality over real CSTs. Both are generated, not committed — CI regenerates before typecheck.The dogfood consumer is
test/ts-ast-lowering.ts, a tsc-shaped AST lowering verified node-by-node (kind + trivia-excluded spans) against the real tsc tree bytest/ts-ast-verify.ts— the gate that caught the precedence and statement-merge bugs above.Gates
29 in
test/check.ts(was 26): +cst-text-invariant,ts-ast-structure,cst-match-totality; plus the engine-parity pins outside the runner:emit-parser-verify(18,805-file emit ≡ interp byte-identical),emit-reject-messages(436 both-reject files, exact farthest-pos message — an arm pruned by one engine but not the other skews error state),emit-lexer-verify(token-stream equality). The tree-sitter LR-conflict closure is complete for the widened grammars (test/collect-conflicts.tsfixpoint), and gen-tm keeps paren-gated type keywords (function(in type position) in the statement-start set so multi-line type regions still close at a real declaration.