Emitted parser to 0.84x tsc, span-only CST, FN=0 vs tsc's parse surface, generated per-arm destructurers by johnsoncodehk · Pull Request #35 · johnsoncodehk/monogram

johnsoncodehk · 2026-06-09T18:12:08Z

Four workstreams on the #8 perf line, each gated end-to-end. Net state: the emitted parser parses the bench corpus at 0.84× tsc's parse-only time (0.64× vs setParentNodes: true), the CST contract is span-only minimal, acceptance vs tsc's real parse surface has zero false negatives corpus-wide, and consumers destructure CST nodes through generated, typed per-arm matchers instead of child-shape probing.

Performance: 4.41× → 0.84× vs tsc parse-only

file	KB	mono	tsc(parse)	tsc(parents)	ratio	ratio(parents)
parserharness.ts	82	3.93ms	3.78	4.95	1.04×	0.80×
fixSignatureCaching.ts	67	0.72ms	1.51	1.88	0.48×	0.38×
parserRealSource7.ts	38	1.11ms	1.51	2.11	0.73×	0.52×
parserindenter.ts	35	0.89ms	1.07	1.48	0.82×	0.60×
aggregate		6.65ms	7.87	10.41	0.84×	0.64×

Profiled with V8 inspector profiles (layer split, line ticks, a --no-turbo-inlining pass for honest attribution), one gated commit per lever. The line profile is now flat — top line 3.8% — i.e. the JS-op level is harvested; the remaining gap layers are GC (14%, the kept CST itself) and the lexer floor, both representation/substrate questions, not op-shaving.

Lever groups (full list in the commit log):

Lexer: per-charCode dispatch pruning; identifier char-loop; lexKwT baked keyword recognizer over source spans (no slice, no hash); tokens born final (k/t interned at creation); a dedicated emitted lexer (src/emit-lexer.ts).
SoA token stream: tokens are five parallel columns (tkK/tkT bytes, tkOff/tkEnd Int32, tkFl bits) written by the emitted tokenize; token text is never materialized — keyword matching reads source charCodes, peek() is gone, '>'-resplit is a copyWithin.
Dispatch: FIRST membership as byte tables; per-rule alt bitmasks with SECOND-token refinement; LED first-token gates open-coded; Pratt operator tables keyed by interned literal ints; multi-word masks — the 32-alt ceiling silently dropped a rule to serial guards when a grammar widening pushed it past 32 (both Type and Expr crossed it, costing ~25% whole-parse with a perfectly flat profile); masks now span words, so dispatch degrades smoothly with grammar growth.
Memo: per-rule parallel position arrays; advance-site maxPos.

Peers: the reference parser of each ecosystem

test/profile-vs-peers.mjs, same methodology. JavaScript vs acorn (the ESTree reference) on real-world files — acorn parsing itself, vue compiler-core, parse5's parser, the 8.9MB typescript.js bundle — lands at parity (0.85–1.1× per file across runs; the 9MB row swings ±10% run-to-run):

file	KB	mono	acorn	ratio
acorn.js	237	11.01ms	11.74	0.92×
compiler-core.cjs.js	200	9.45ms	10.84	0.87×
parse5 parser/index.js	105	4.78ms	4.37	1.09×
typescript.js	8,899	611ms	567	1.08×

HTML vs parse5 (the WHATWG reference) on synthesized well-formed fragments: ~2.2× — the named gap is that markup-mode lexing still runs the interpreted lexer (emit-lexer specialization covers the token-stream grammars, not markup mode yet). Output shapes differ by design (full CST with every token as a leaf vs ESTree AST vs DOM-shaped tree), so both tables are parse-to-tree wall time on identical inputs.

Benching acorn.js exposed a real grammar bug: break/continue labels missed the spec's restricted production (break [no LineTerminator here] Label) and the reserved-word guard, so break ⏎ case "X": inside a switch ate case as a label and cascaded into a reject. Both ECMAScript grammars now guard the label with (sameLine, notReserved).

Measured-flat-or-negative and reverted, for the record: persistent memo arrays, string quote-loop, switch-direct dispatch, probe-then-build (rejected by design analysis — memo shares loser-arm subtrees with the winner, so the discard waste is only 3-5%).

CST contract: span-only, two shapes, nothing derivable

leaf: { tokenType, offset, end }            // text  ≡ source.slice(offset, end)
node: { rule, children, offset, end }       // leaf? ≡ 'tokenType' in n / 'children' in n

text and kind are gone (both derivable; both measured as pure overhead — dropping them was +7% and +5.5%). getText(node, source) is exported by both engines. test/cst-text-invariant.ts pins the exact shape over all seven grammars (generative corpora + a TS corpus stride).

Acceptance: FN = 0 against tsc's parse surface

Measured bidirectionally over the full conformance corpus (5,659 files, monogram verdict × tsc parseDiagnostics): both-accept 5,112 · both-reject 258 · false negatives 0 · over-accepts 289. Every construct tsc parses cleanly now parses — each residual FN was probed against tsc and widened to the proven surface: JSDoc types in normal TS positions (T?/T!/?T/!T/*/function(...)/Array.<number>, with tsc's exact isStartOfType disambiguation on postfix ?), decorator expressions (@x?.y, tagged templates, @new x), decorator placement (@dec var x, export @dec default class), object-literal and parameter modifier soups, ?/! after object member names, rest-binding ...r: n, nameless class declarations, string import/export specifiers, typeof import("m").Thing, and async arrows with a bare parameter (async err => … previously mis-split into two statements — a structural bug that accept-level metrics could not see).

Three more structure bugs surfaced by the new structural oracle and fixed: mixfix-LED precedence (grammar.ledPrecs, an orthogonal precedence field so seven highlighter consumers never see it), the ES2023 §14.5 expression-statement lookahead (function/async function/class), and bare for (x in y) (the exclude('in', …) no-in context).

The dominant remaining over-accept class is newline-blind ASI (same-line statement splits tsc rejects) — a future round of its own.

Generated consumer toolkit

npm run gen emits, per grammar, <g>.cst-types.ts (typed node unions) and <g>.cst-match.ts (per-arm destructurers): matchStmt(n, src) returns a tagged { arm, …named fields } union, so consumers write switch (m.arm) instead of probing children. Matcher semantics mirror the engine exactly (literal token kinds, sep trailing delimiters, template duals, op forms) and test/cst-match-totality.ts proves totality over real CSTs. Both are generated, not committed — CI regenerates before typecheck.

The dogfood consumer is test/ts-ast-lowering.ts, a tsc-shaped AST lowering verified node-by-node (kind + trivia-excluded spans) against the real tsc tree by test/ts-ast-verify.ts — the gate that caught the precedence and statement-merge bugs above.

Gates

29 in test/check.ts (was 26): + cst-text-invariant, ts-ast-structure, cst-match-totality; plus the engine-parity pins outside the runner: emit-parser-verify (18,805-file emit ≡ interp byte-identical), emit-reject-messages (436 both-reject files, exact farthest-pos message — an arm pruned by one engine but not the other skews error state), emit-lexer-verify (token-stream equality). The tree-sitter LR-conflict closure is complete for the widened grammars (test/collect-conflicts.ts fixpoint), and gen-tm keeps paren-gated type keywords (function( in type position) in the statement-start set so multi-line type regions still close at a real declaration.

A standalone, debug-only tracer: startTrace()/endTrace() markers (near-free no-ops in production, src/trace-markers.ts) bound a region; test/exec-trace.ts AST-instruments a copy of the target (TS compiler API — no new deps) so every executed statement/initializer/return/condition records its line, source text, runtime value and a cost tag {alloc|map.get|call}, gated on the markers. The flattened executed source is printed with calls inlined and repeated call-frames grouped by DISTINCT execution path (each path once + count + per-line value distribution), so a hot function's N calls read as a few paths, not N×. Scopes: the time-region markers, or --lines=A-B for a spatial slice. Reads the actual run, so it reflects the real engine, not an approximation.

A negative lookahead over a literal/alternation of keyword literals (e.g. an identifier that isn't a reserved word, `not('catch'|'delete'|…)`) was matched by trying each literal in turn — O(N) matchLiteral calls plus an `out` allocation per arm, at every identifier position (the hottest in the grammar). The keyword set is static, so collapse it to one membership test: the not fails iff the token is an ident-kind whose text is one of the keywords. Mirrored in createParser (gen-parser.ts notKwSet) and emitParser (emit-parser.ts notKwKinds), emitting the same check matchKwLit uses → byte-identical. Emitted parser ~5% faster (parserharness 19.7→18.7 ms), interpreter ~9–12%; the reserved-word scan for `a + b` drops from 12 matchLiteral calls to 0. Byte-identical CST + accept/reject (createParser ≡ emitParser, run-conformance 5386/5659, 26/26 gates).

…urn directly Two emit-only rewrites (the interpreter is the oracle; emit≡interp is gated byte-identical on the full 18,805-file corpus): - descMatchesTok's switch body is inlined into ruleMightStartDescs and canStartFT, removing the dead helper and the call boundary on the per-token dispatch loops. - Matchers reducible to a single literal/token-ref/rule-ref emit 'const v = match(); return [v]' directly instead of out=[]+push (96 of the TS grammar's matchers specialize). Measured ~1-2% on the PR#4 bench files (parserharness +2.3%).

…loop The longest-match alt guards and rule-ref guards tested FIRST-set membership by looping a descriptor array per call (ruleMightStartDescs) — the CPU profile showed it at ~13% self time. Each set now emits a per-set fn over two Uint8Array tables indexed by the token's baked ints: _qN(tok) = !tok || (KT[tok.k] | TT[tok.t]) !== 0 two loads + an or, no loop. Faithful to the loop because the keyword and punct int ranges are disjoint (the loop's k!==K_PUNCT / k===K_PUNCT guards were redundant) and the punct startsWith arm is enumerated over the closed punct vocabulary at emit time. Tables and fns are deduped across sets (typescript: 103 fns / 103 shared arrays). Gates: full 18,805-file corpus byte-identical, conformance, 26/26 check. Bench (PR#4 files, interleaved best-of-N): aggregate +10.6~12.0%, every file positive (+5~14%).

The CPU profile (vs tsc's scanner) showed the lexer's cost is not the token regexes but the data-driven dispatch around them. Three prunings, all grammar-agnostic: - punct literals: first-char group index. Only literals sharing the position's first char can startsWith-match, so the longest-first scan runs over that group instead of all literals (TS: 57 -> ~3). - token matchers: the template/markup/indent exclusions are position-independent — filter the matcher list once at createLexer time instead of re-testing three Set.has per matcher per position. - whitespace: consume the ASCII \s run ({9..13, 32}) with a char loop; only a non-ASCII candidate falls back to the \s regex (Unicode spaces), preserving exact \s semantics and the includes('\n') -> newlineBefore stamping. Gates: token-stream equality vs the old lexer over the 5,695-file conformance corpus (5,624 identical streams + 71 identical-message throws, 0 diffs), conformance 5386/5659 unchanged, 26/26 check (all grammars). Bench (PR#4 files): tokenize +87~95%, whole parse +35~46%.

canStartFT — the per-LED single-descriptor switch in the Pratt dispatch loop — was the next profile hotspot (~7% self). A first-token gate is a 1-element FIRST set, so it reuses the membershipFn byte tables, open- coded at the call site (tok is already known non-null there): (KT[tok.k] | TT[tok.t]) !== 0 no call, no switch. The canStartFT runtime helper and the descriptor literals (keyDescLiteral/firstTokDescLiteral) are gone. Gates: full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +7.8~10.3%, every file positive.

…terned fields The no-inlining profile attributed ~14% of parse time to the token interning loop: internTok ADDED k/t to already-shaped lexer tokens (two hidden-class transitions per token, on top of the conditional newlineBefore stamp making shapes polymorphic at every tok.k/tok.t site). The tokenize wrapper now rebuilds each token as a single literal with every field the parser reads (type/text/offset/k/t + the three stamp flags normalized to booleans) — one shape from birth. The matchPuLit '>'-split tokens go through the same mkPunct shape. internTok is gone. Gates: full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +3.1~5.4% across three runs.

The LED loop did a string-keyed opTable.get(tok.text) for every token it reached (and the prefix-op nud a prefixOps.get); tok.t is already interned, so both become an array load (OP_BY_T / PREFIX_BY_T, null- packed, length = the literal-int space). Equivalent because a token's text can equal an operator value only for punct tokens and keyword- shaped idents — exactly the classes tok.t indexes; operator values are in the literal vocabulary by construction (asserted at emit). Gates: full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +1.8~2.8% over three runs.

Every multi-alt site (non-rec alts, left-rec atoms, Pratt nuds) guarded each alternative with its own membership-fn call — R_Stmt burned ~20 _q calls per statement position. Each alt list (3..32 alts) now bakes one bit per alt into two Int32Array mask tables over the token int spaces: mask = startTok ? KM[startTok.k] | TM[startTok.t] : ALL and each alt's guard is one bit test. Bit i is set exactly where the old altGuard was true (always-tried alts in every k slot; EOF admits all), so the same alt subset runs in the same order — byte-identical by construction. _q(startTok) guard calls: 162 -> 15 (the sub-3-alt lists). Gates: full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +10.4~15.2%, every file positive in both runs.

…ation The emitted parser rebuilt every lexer token (second allocation + TYPE_KIND/LIT_KW/LIT_PU dictionary lookups per token) to attach the int kinds — ~21% of parse time after the dispatch rounds, plus the GC share. createLexer now takes an optional intern config (typeKind/kwLit/puLit maps + punct/fallback kinds) and every creation site builds the FINAL token through two single-allocation builders: - mkNamed(type, text, offset, k): k is baked per site — token matchers and prefixed-ident entries carry their kind, fixed-name sites (template/markup/indent/newline) use consts computed once; t is the text's keyword int (one Map.get — the same lookup tsc's getIdentifierToken pays). - mkPu(text, offset, t): punct t is precomputed per first-char-group entry — zero runtime lookup. Stamp flags are real fields from birth (push writes existing fields, no hidden-class transitions). The three post-push mutations re-intern exactly (flow plain-fold merge, markup void retag, unicode ident extension). Without an intern config (the interpreter) k/t are 0 via an empty map; the interpreter's token shape is unchanged in behavior. The emitted tokenize wrapper is gone — the lexer's array is the parser's array; matchPuLit '>'-splits build the same shape. Gates: token-stream equality 5,695 files (0 diffs), full 18,805-file corpus byte-identical, 26/26 check. Bench (strict A/B vs the full previous state): aggregate +12.8~13.0%, every file positive (fixSignatureCaching +19~23%).

emitLexer(grammar) specializes tokenize() for token-stream grammars: one switch(charCode) whose cases hold exactly that char's candidate token regexes (declaration order) and a longest-first punct compare-chain with the literal's int baked at each leaf. Token regexes stay V8 regexes (the dispatch was the measured cost, not the matching). Regex-vs-division context, paren-head tracking and the template state machine are baked: prevIsValue is four int-table loads, the '('/')'/'!' bookkeeping exists only on those punct branches, scanTemplateSpan/identTextValid are emitted with their config inlined. Grammars using markup/indent/newline fall back to the createLexer import (emission unchanged for yaml/html). New gate test/emit-lexer-verify.ts: emitted stream ≡ createLexer over the 5,695-file conformance corpus — every field including k/t and the stamp flags, plus identical error messages (5,624 same + 71 same-throw, 0 diffs). Gates: full 18,805-file corpus byte-identical, conformance 5386/5659 unchanged, 26/26 check. Bench: whole parse +57~77%; vs ts.createSourceFile the aggregate is now 1.24x (parse-only) / 0.94x (setParentNodes:true) — below parity on the latter, with fixSignatureCaching at 0.69x even against parse-only.

johnsoncodehk · 2026-06-09T18:49:13Z

Added the emitted lexer (f4ceb13): emitLexer(grammar) bakes the whole per-position dispatch — per-charCode switch over candidate token regexes + longest-first punct compare-chains with interned ints at the leaves, baked regex-context tables, emitted template/ident-escape machinery. Markup/indent/newline grammars fall back to the createLexer import unchanged. New gate test/emit-lexer-verify.ts (token streams + error messages ≡ createLexer, 5,695 files, 0 diffs).

Updated headline (PR #4 bench files, aggregate):

	mono ms	vs tsc parse-only	vs tsc setParentNodes:true
session start	34.8	4.41×	3.33×
now	9.8	1.24×	0.94×

Per file vs parse-only: fixSignatureCaching 0.69×, parserindenter 1.23×, parserRealSource7 1.28×, parserharness 1.45×. Remaining hotspots: R_Expr_pratt ~15%, GC ~18% (CST/token allocation), lexMk ~5%, matchPuLit ~4.6%.

- The Pratt LED chain re-tested maxBp > minBp per LED and gated each LED with its own two-table load; the shared test is hoisted and the per-LED first-token gates collapse into one mask pair (ftMaskDispatch, same Int32Array machinery as the alt dispatch). - matchKwLit/matchPuLit: the t int ranges are disjoint, so the k >= K_NAMED_MIN / k === K_PUNCT guards were redundant — one int compare each. The '>'-split moves to matchPuLitGT, emitted only at '>' call sites. - Fixes a latent bug the gates could not reach: the generic matchLiteral fallback still indexed LIT_KW/LIT_PU as objects after they became Maps (dead on the TS grammar's fully-specialized sites, wrong if ever reached). Gates: emit-lexer-verify 5,695 files 0 diff, full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +7.5~14.5%, every file positive in both runs.

johnsoncodehk · 2026-06-09T18:59:06Z

Round 3 (commit ea3f74f): LED chain bitmasked + hoisted maxBp test; matchKwLit/matchPuLit to one int compare (disjoint t ranges make the k guards redundant; '>'-split isolated to matchPuLitGT); fixed a latent Map-indexing bug in the generic matchLiteral fallback. +7.5~14.5%.

Aggregate now 9.3ms — 1.20× vs tsc parse-only / 0.87× vs setParentNodes:true (per file vs parse-only: 0.66× / 1.11× / 1.21× / 1.45×).

- parseRuleEntry looked the memo up through a string-keyed outer Map per rule entry; each memoized rule now owns an array slot allocated at emit time (memo = new Array(MEMO_RULES), hit = memo[idx]). - lexMk did a LIT_KW.get for every named token; a matcher (or template / prefixed-ident site) whose first-char set is disjoint from every keyword's first char provably interns t = 0 — those sites emit a lookup-free builder (lexMk0). The identifier matcher keeps the lookup (the same one tsc's getIdentifierToken pays). Gates: emit-lexer-verify 5,695 files 0 diff (k/t compared), full 18,805-file corpus byte-identical, 26/26 check. Bench: aggregate +1.4~5.8% across four runs (machine noisy; sign stable).

johnsoncodehk · 2026-06-09T19:08:49Z

Round 4 (34bb02d): memo keyed by emit-time rule index (array slot instead of string-keyed Map per rule entry); lexMk bakes t:0 for matchers whose first-char set provably can't start a keyword (lookup only remains on identifiers, mirroring tsc's getIdentifierToken). +1.4~5.8% across four runs. Cumulative: 34.8ms → ~9ms, 4.41× → ~1.2× vs tsc parse-only, ~0.85× vs setParentNodes:true.

…ness - test/profile-vs-tsc.mjs: dual V8 inspector profiles (emitted parser / ts.createSourceFile) with the timing ratio table, lexer/parser/GC layer split and top self-time tables; run with --no-turbo-inlining for honest per-function attribution. - test/profile-lines.mjs: positionTicks per (file,line) over a saved .cpuprofile, printed with source text — the op-level view. - test/ab-emitted.mjs: the interleaved best-of-N A/B gate used to keep/revert every lever in PR #35.

tokenPatternCharLoop derives a charCode loop plan from seq(first, star(cont))- shaped token IR: scan the plain continuation class with int compares, fall back to the full regex only when the stop char could begin a complex alternative (an escape opener). Exact by construction: plain chars and every complex alternative's first chars are required disjoint, so the greedy star is deterministic and a non-bail stop char provably ends the regex match too. Kills the regex exec + match-array allocation on the hottest token class (identifiers, ~40% of tokens); the escape-validation call is skipped on the fast path since the loop proved no backslash. A/B on the PR#4 bench: +1.7% to +7.8% aggregate across 4 runs, all positive; emit-lexer-verify (5,695 files, all fields + error messages), 18,805-file byte-identical gate and 26/26 checks pass.

peek() paid a parseLimit branch plus a ?? undefined-check on every call for a cap that only two emitted mixfix re-parse sites ever set. Maintain cap = min(parseLimit-or-infinity, tokens.length) at those sites, the '>'-splice and parse() entry instead; peek is now one compare and an always-real token load. (The ternary form measured NEGATIVE -2.5..-5.6% — V8 prefers the early-return branch shape; the if-form measures +1.4..+5.4%, 5/6 runs positive.) Byte-identical on the 18,805-file corpus, 26/26 gates.

lexMk paid a Map.get(text) per named token — a string hash over a fresh slice every time (V8 can't reuse a cached hash on a new string). Emit lexKwT from the keyword symtab instead: length window, first-charCode switch, then per-keyword compare chains shortest-first. Returns exactly LIT_KW.get(text) ?? 0 by complete enumeration; non-keyword identifiers exit at the length window or the switch default in a couple of int compares. Also replaces the re-intern in the unicode ident-extension path. A/B: +9.3..+10.5% aggregate over 3 runs (parserharness +16% — identifier-dense); lexer-verify, 18,805-file byte-identical and 26/26 gates pass.

parseRuleEntry paid a Map hash per entry and a {node, end} wrapper allocation per store, at a measured 51.8% hit rate (memo is load-bearing — half the entries are avoided longest-match re-parses). Replace each rule's Map with a pair of arrays indexed by start pos, lazily sized to the token count: a lookup is two undefined-sentinel array loads, a store allocates nothing. The pair is captured together at entry and created together at store time — a '>'-splice inside core() detaches both via fill(undefined) and the late store lands in the detached pair (discarded), exactly the old Map's semantics; the first cut captured only one side and crashed on memoNode[idx] === undefined after an in-flight splice (caught by the 18,805-file gate, 318 divergences). A/B: +6.8..+13.7% aggregate over 3 runs; byte-identical corpus and 26/26 gates pass.

peek() updated the farthest-position high-water mark on every call — the single hottest line in the profile (3.6% self) — for state that only error messages read. Track it at pos-advance sites instead (one compare per consumed token rather than per lookahead); restores only ever lower pos, and a memo-hit restore needs no update since the stored end was recorded when first reached. Error text is now pinned by a new gate, test/emit-reject-messages.ts: over the full conformance corpus every both-reject file must throw the EXACT interpreter message (466 files, 0 mismatches before and after). A/B: +1.1..+7.5% aggregate, 4/4 runs positive; byte-identical corpus and 26/26 gates pass.

…le bits A token-name key gathered under a not(alt('if', 'var', ...)) guard immediately before the first consuming element becomes a qualified key, emitted as TM[0] plus every keyword t outside the guard class instead of the blanket k-bit — a guarded-out keyword lookahead no longer admits alternatives it provably cannot start (it would fail the not-guard before consuming). Pure keyword literals inside their own guard class are dropped from the set entirely. Sharpens the identifier-led bits of the Expr bare-ident nud, the object-literal shorthand and class-name alternatives, and every _qN entry guard derived from them. A/B: +2.1..+10.7%, 3/4 runs positive; byte-identical corpus, reject messages identical, 26/26 gates.

… literals exprFirst collapsed any rule whose body contains a [prefix, operand] form to null/always-admit — FIRST(Expr) was unknown, so every expression-led alternative and every Expr rule-entry guard admitted everything. A prefix item consumes exactly one of the prefix-operator literals, so contribute those and stop. The many(Stmt) loop guard now rejects '}' before entering R_Stmt at all (1,790 dispatches with their arm runs gone per bench round; R_Stmt admitted arms 16,052 → 14,260). A/B neutral on its own (-0.3/+0.1%) — kept as a strict table sharpening and the prerequisite for second-token dispatch refinement. Byte-identical corpus; 26/26 gates.

Compute per-alternative SECOND sets (the keys admissible as a match's second token, plus whether a one-token match exists): an admitted alternative whose SECOND set excludes the actual second token — and that cannot end after one — provably fails, so its arm is skipped. Kills the labeled-statement arm at every identifier-led statement without a ':' second token, the single-param arrow head without '=>', and friends: R_Stmt admitted arms 14,260 → 9,560 and nud arms 13,537 → 8,968 per bench round. The emitter folds it into the existing alt-mask dispatch as a second pair of Int32 tables ANDed in (alts with unknown/len1/nullable/empty SECOND keep their bit everywhere and in the EOF-after-one mask); pratt op/prefix/postfix items contribute their operator literal sets, and a '>' SECOND key admits every '>'-led punct so the '>'-splice stays covered. The pruning must be ENGINE-IDENTICAL: the first cut (emitted only) passed the byte-identical corpus but tripped the reject-message gate — a skipped arm no longer advanced the farthest-position error state that the interpreter's run of the same arm did. So gen-parser gets the same analysis as altMightSecond checks in its three dispatch loops, and the emit side computes SECOND from PLAIN FIRST inputs (no reserved-qualified keys, prefix-to-top) so both engines derive identical sets and identical prune decisions by construction. A/B: +2.3..+15.6%, 6/7 runs positive; 18,805-file byte-identical, reject messages identical (466 files), 26/26 gates.

johnsoncodehk · 2026-06-09T20:33:58Z

Round 3 (post-compaction): aggregate 0.86× vs tsc parse-only / 0.64× vs setParentNodes=true (round start: 1.17× / 0.86×). parserharness — the expression-dense laggard — is now at 1.00×.

Levers this round, each A/B-gated (byte-identical 18,805-file corpus + 26/26 gates + new reject-message gate):

Char-loop fast path for ident-shaped tokens (tokenPatternCharLoop derives the loop plan from the token IR; escape stop-chars fall back to the regex): +1.7..+7.8%
peek() single-compare (fold parseLimit into a maintained cap bound): +2%
Baked keyword recognizer (lexKwT length-window + charCode trie replaces Map-hash on a fresh slice per token): +9.3..+10.5%
Memo as parallel position arrays (two array loads per lookup, zero-alloc stores; 51.8% measured hit rate — memo is load-bearing): +6.8..+13.7% (a persistent/generation-versioned variant measured NEGATIVE — old-gen write barriers — and was reverted)
maxPos moved from every peek() to the twelve advance sites, pinned by a new gate test/emit-reject-messages.ts (every both-reject corpus file must throw the exact interpreter message): +1.1..+7.5%
Reserved-aware FIRST keys (a not(keyword-class) guard narrows the dispatch bits; FIRST(Expr) became a real set via prefix-op literals): +2.1..+10.7%
SECOND-token dispatch refinement in both engines (per-alt second-token sets ANDed into the alt masks; R_Stmt admitted arms 14,260→9,560, nud arms 13,537→8,968; ported to gen-parser so prune decisions are engine-identical — the message gate caught the emitted-only version skewing farthest-position error state): +2.3..+15.6%

Inspired by the alien-signals / TSSLint performance notes (object shapes, in-object property pressure, lazy materialization): the token stream becomes five parallel columns — kind, literal id, span start/end, stamp bits — written directly by the emitted tokenize. Token text is never materialized during lexing; the baked keyword recognizer reads source charCodes over the span, and a CST leaf slices the span only when it is built. peek() disappears entirely (pos indexes the columns; matchers and the byte-table guards read tkK/tkT directly), the '>'-split shifts the columns in place, and the column element width is chosen at emit time (Uint8 when the id spaces fit a byte). Grammars on the createLexer fallback (markup/indent/newline) convert the object stream into the same columns at parse() entry with a text column, so the runtime has a single form; tokenAt(i) reconstructs the object view for the lexer gate. Measured: GC layer 17.6% -> 14.0%, the peek frame (6.2% self) and the token constructors (lexMk/lexMkPu/push, ~4%) gone; wall time ~flat to +1.5% on the bench (the earlier born-final/char-loop/lexKwT rounds had already removed most lexer-side allocation cost, and per-matcher typed-array loads offset the rest). Kept as the structural base: the remaining GC is parser-side CST building (loser arms), which needs probe-then-build, not token shape. All gates green: lexer stream identical (5,695 files), 18,805-file byte-identical corpus, reject messages identical, 26/26.

classifyKey's 'tok' variant carries no t — narrow the mixfix separator (a literal, so kw/punct in practice; -1 never-matches defensively), and Emitter.a becomes readonly for emitRuntime's symtab access. Emitted output byte-identical (tsc --noEmit clean, full corpus gate re-run).

…erge, gate the invariant The ONE construct in the whole product where a leaf's text was not source.slice(offset, end) was the yaml flow multi-line plain-scalar merge, which rewrote the merged token's text to the FOLDED value ('multi line value') — and whose end (offset + folded length) therefore landed mid-token, a span that meant nothing. A concrete CST should carry the raw span; the fold-to-one-space is the scalar's VALUE semantics and belongs to consumers that resolve values. The merge now slices the raw source span across the run (end becomes the true end); merge structure, type/key-ness, comment guards and interning are unchanged. All yaml gates (scope-gap, depth witnesses, issue-12 regressions, generative) pass unchanged. New gate test/cst-text-invariant.ts (in check.ts): every CST leaf across all seven grammars must satisfy text === source.slice(offset, end) — the generative corpus per grammar plus a TS conformance stride sample (43,999 leaves, 0 violations). This invariant licenses dropping the leaf text field from the CST contract: text is derivable data.

A leaf is now {kind, tokenType, offset, end}. Its text was redundant data — text === source.slice(offset, end) held everywhere once the yaml flow-merge anomaly was fixed (previous commit), and the cst-text-invariant gate now pins the span-only shape itself (no text property, sane spans; 43,999 leaves across all seven grammars + a TS corpus sample). Consumers derive text from the source they parsed: both engines export getText(node, source), and the in-repo consumers (html-conformance tree extraction, the generative net's leafRoles, gap-ledger probes) take the input alongside the CST. Both engines change together: matchers and the pratt operator paths stop materializing text (matchKwLit/matchPuLit lose their value parameter — the leaf no longer needs it), the no-unary-LHS head check and the markup open/close tag-name comparisons slice the source span, and the generated *.cst-types.ts leaf interface drops the field. interp ≡ emit holds on the full 18,805-file corpus; lexer streams and reject messages identical; 27/27 gates. Side effect on the PR#4 bench: dropping ~10k leaf-text slices per parse and shrinking every leaf by a field measures +2.1..+14.7% aggregate, 6/8 runs positive (mean ~+7%).

johnsoncodehk · 2026-06-09T21:56:58Z

Round 5 — the CST contract change: leaves are now span-only ({kind, tokenType, offset, end}, no text field).

Sequence:

The one anomaly fixed: the yaml flow multi-line plain-scalar merge rewrote the merged token's text to the FOLDED value, and its end (offset + folded length) landed mid-token — a span that meant nothing. A concrete CST carries the raw span; folding is value semantics and belongs to consumers. All yaml gates pass unchanged.
Invariant gated: test/cst-text-invariant.ts (now in check.ts → 27 gates) — every leaf across all seven grammars satisfies the span-only shape (43,999 leaves: generative corpus per grammar + a TS conformance stride sample, 0 violations).
Contract change: both engines stop materializing leaf text (matchKwLit/matchPuLit lose their value parameter; the no-unary-LHS head check and markup tag-name comparisons slice the source span); generated *.cst-types.ts drop the field; getText(node, source) exported by both engines; in-repo consumers (html-conformance tree extraction, the generative net's leafRoles, gap-ledger probes) take the source alongside the CST.

interp ≡ emit on the full 18,805-file corpus, lexer streams + reject messages identical, 27/27 gates. Bench side effect: ~10k leaf-text slices per parse gone and every leaf one field smaller — +2.1..+14.7% aggregate, 6/8 runs positive (mean ~+7%).

…rally kind carried zero information — a leaf always has tokenType and a node always has rule + children, two disjoint field sets, so the tag was derivable from property presence. A leaf is now {tokenType, offset, end} and a node {rule, children, offset, end}; consumers discriminate with 'tokenType' in n / 'children' in n (TypeScript property-presence narrowing covers the union), and the generated *.cst-types.ts drop the field. The cst-text-invariant gate now pins the full shape on both sides: no kind, no text, leaf/node field sets exact, sane spans. In-engine reads (the no-unary-LHS head check, markup tag-name extraction, the html/coverage tree walkers, exec-trace) switch to structural checks; the emitted pratt path simplifies (head.tokenType === '$operator' alone — a node's tokenType is undefined and never matches). interp ≡ emit on the full 18,805-file corpus, reject messages identical, 27/27 gates. Serialized CSTs shrink by one field per object; bench: +3.4..+6.7% aggregate, 4/4 runs positive (one fewer slot per object and one fewer store per build).

…l oracle test/ts-ast-lowering.ts lowers the Monogram TypeScript CST into an AST whose node kinds and spans mirror tsc's, written deliberately as a CONSUMER would write it — every friction met is tagged PAIN(1..19) at the exact site. test/ts-ast-verify.ts compares the result against the real tsc AST pre-order (kind as SyntaxKind numbers, getStart/end), over a 30-snippet battery and any real file. State: 30/30 snippets node-identical to tsc; parserindenter.ts (35KB) matches on 3,142 nodes with exactly ONE divergence left — and that one is not lowering pain, it is a parser structure bug the experiment uncovered: a == b ? c : d parses as a == (b ? c : d) (tsc: (a == b) ? c : d) a + b as T parses as a + (b as T) (tsc: (a + b) as T) …same for satisfies — every alternative-form mixfix LED binds maximally tight, because the LED loop gates them only on maxBp > minBp: they carry NO precedence, so they fire inside any operator's rhs. Invisible to all 27 gates (accept/reject and token scopes are grouping-blind); the first structural oracle caught it immediately. Not wired into check.ts yet: the gate must stay red until the mixfix-LED precedence fix lands, then this becomes the parser↔tsc STRUCTURE conformance gate the product was missing.

…ation The LED loop gated rule-alternative LEDs only on maxBp > minBp — they carried no precedence, so they fired inside ANY operator's right operand and bound maximally tight: a == b ? c : d parsed as a == (b ? c : d), a + b as T as a + (b as T), and in/instanceof additionally right-chained (a in b in c as a in (b in c)) because their trailing self-operand re-entered the rule at bp 0. Invisible to every accept/reject and token-scope gate; found by the AST dogfood's structural oracle. New grammar data (NOT a precs-ladder entry — the ladder has seven-plus scoping/branch consumers across the highlighter generators that must not see these): grammar.ledPrecs anchors a led's connector to a ladder operator — sameAs borrows its lbp, below sits one notch under (levels are spaced 2 apart), chainRhs parses the trailing self-operand at that lbp (left-chaining) instead of as a full expression. ecmaPrec ships ?:(below '??'), in/instanceof (sameAs '<', chained); TypeScript adds as/satisfies (sameAs '<'). Both engines resolve identical numbers: the interpreter gates the led and special-cases the chain rhs via parsePratt(rule, rhsBp); the emitter bakes the lbp into the led conds and emits a custom chain arm calling R_<rule>_pratt(rhsBp). The no-in suppress machinery is untouched (the gate is additive on the same led path). Verified: ts-ast-verify 30/30 snippets AND parserindenter.ts (3,168 nodes) node-identical to tsc — now wired into check.ts as the parser↔tsc STRUCTURE conformance gate (28 gates); 18,805-file interp≡emit; reject messages identical; conformance accept-rate byte-equal to baseline (5386/5659 — the fix is structure-only).

johnsoncodehk · 2026-06-09T23:06:13Z

Round 7 — the dogfood paid off twice over:

Parser structure bug found & fixed: alternative-form Pratt LEDs (ternary ?:, as, satisfies, in, instanceof) carried no precedence — they fired inside any operator's rhs and bound maximally tight (a == b ? c : d → a == (b ? c : d); a in b in c right-chained). Invisible to every accept/reject and token-scope gate. Fix: grammar.ledPrecs anchors a led connector to a ladder operator (sameAs/below, optional chainRhs for left-chaining the trailing self-operand); both engines resolve identical bp numbers. Accept-rate is byte-equal to baseline — the fix is structure-only.
New gate: ts-ast-structure (28 gates now) — a tsc-shaped AST lowering (test/ts-ast-lowering.ts, written as a real consumer with 19 PAIN points tagged in situ) compared node-by-node against the real tsc AST (kind + getStart/end, pre-order). 30-snippet battery + parserindenter.ts (3,168 nodes) are node-identical to tsc. This is the parser↔tsc STRUCTURE conformance measurement the product was missing.

Bug #2 - statement-position function merged with a following call: longest- match let the expression arm win whenever a member/call tail made it LONGER ("function f(){}" + newline + "(g)()" became ONE IIFE-style expression statement; tsc keeps a declaration + a separate statement). Fix: the ES2023 14.5 ExpressionStatement lookahead restriction - the expression arm may not begin with function / async function. "class" is deliberately NOT guarded yet: the class-DECLARATION arm is narrower than tsc's (extends-expression heritage, bare ';' class elements, decorator placements), so 31 tsc-valid corpus files still rely on the class-EXPRESSION fallback - widening the declaration arm is the named prerequisite. The guard flips exactly 3 corpus files to reject, all tsc-INVALID (template-typed params, "function* gen" without parens): FN=0 held. Rest parameters also gained their checker-not-parser tail ("...b = init", "...b?: T") so the declaration arm carries what the expression fallback used to. Bug #3 (pre-existing, all builds) - bare for-in heads: ForHead's no-declaration arm parsed its target Expr WITHOUT the no-"in" exclusion, so "for (key in obj)" swallowed the "in" inside an in-LED, the arm failed, and the statement fell back to a CALL parse "for(...)" with "for" as an identifier. Same exclude as binding initializers on the target. The interpreter's maxPos moves to advance-based tracking (mirroring the emitted engine's relocation) - the new reject inputs exposed that the two engines' farthest-position semantics had silently diverged; reject messages are engine-identical again (468 files). ts-ast-verify gains the valid-only contract: files where tsc itself reports parse errors are SKIPPED (error-recovery shapes are each parser's own policy, not the grammar contract) - parserharness/parserRealSource7 fall under it; the lowering still gained their missing shapes (NewTarget index form, old-style type assertion). Gate corpus widened to fixSignatureCaching (2,952 nodes) + parserindenter (3,168), both node-identical to tsc. 28/28 gates; 18,805-file interp=emit; conformance 5383/5659 (baseline minus the 3 correct rejects).

src/gen-cst-match.ts is the VALUE-level sibling of gen-ast-types: for every rule it emits a typed result union ({ arm: 'if_', expr, stmt, stmt2? } | ...) and a match<Rule>(node, src) that re-derives which grammar alternative a node matched and binds its children to named fields - the discrimination the parser performed and the CST does not record. Each alternative compiles to a step plan (lit / litAlt-capture / tok / node / opt / many / sep / branches) that renders both the type and a cursor-based unifier; the unifier mirrors the engine's matcher semantics exactly: tokenType-exact literal checks (the $keyword-vs-Ident tie facts), the interpolated-template dual (a template token ref accepts a Template leaf OR a '$template' node), pratt operator forms (binaryOp/prefixOp/postfixOp synthesized beside led/nud arms), sep()'s consumed trailing delimiter, and greedy no-backtracking quantifiers (children always reflect the greedy success path, so local greedy decisions reproduce the parse). Pure-literal alternations capture the matched text as a string-literal-union field (alt('let','const','var') becomes let_Kw: 'let' | 'const' | 'var'). Wired into the gen pipeline: npm run gen emits <grammar>.cst-match.ts for all seven grammars. New gate test/cst-match-totality.ts (in check.ts, 29 gates): every node of every generated-corpus CST plus a TS conformance stride sample must destructure through its rule's matcher with full child consumption - 32,336 nodes, 0 misses on first run. ts-ast-lowering's statement layer now CONSUMES the generated matcher (switch on m.arm replaces the hand-probing that PAIN 3/5/7 documented), and picked up break/continue labels and with-statements in the rewrite; ts-ast-structure stays node-identical to tsc (32/32). Cost today (maximal workload - destructuring EVERY node, naive ordered arm try): +64% over the CST pass; the full tokens->CST->AST pipeline measures 0.96x vs tsc createSourceFile on the four bench files. Known v2 lever: the dispatcher tries arms in declaration order, so late arms (pratt binaryOp) pay ~20 failed tries - a first-child discriminator switch will cut most nodes to 1-2 tries.

The expression-statement lookahead guard, the bare for-in exclude and the rest-parameter checker-tail flow into the generated tree-sitter DSL; CI's artifact-sync step caught that they were not regenerated with b814d96.

The v1 dispatcher tried arms in declaration order, so the late pratt op forms paid ~20 failed unifications per expression node. v2 derives each arm's FIRST-CHILD admission keys from its step plan (node rule / leaf tokenType / literal first charCode, with nullable-first arms admitting everything) and dispatches through nested switches: node child -> rule bucket, leaf -> tokenType bucket ($keyword/$punct sub-switched on the first character). Big node-rule buckets (a pratt rule's self bucket holds every led + op form) sub-dispatch one level deeper on c[1] - the connector position. Buckets are SUPERSET filters (every arm fn still verifies exactly) and preserve declaration order internally, so tie semantics are unchanged. Destructure-EVERY-node overhead over the bare CST pass: +64% (v1) -> +39% (one level) -> +31% (two levels); totality unchanged (32,336 nodes, 0 misses across all seven grammars + the TS corpus sample) and the tsc-structural oracle stays node-identical (32/32).

…e truth A longest-match TIE between one-token alternatives goes to the first-listed one. With the bare-identifier nud listed first, "this"/"true"/"false"/"null"/ "undefined"/"super" in expression position were stamped as Ident leaves and every consumer had to re-classify them by text (the dogfood's PAIN 19). Listing the literal alternatives first flips the tie: the leaves arrive as $keyword and the tree records what the word IS. Accepts are unchanged (conformance 5383/5659, byte-equal), interp=emit holds on the full corpus, reject messages identical, and the tsc-structural oracle stays node-identical - the lowering's text re-classification block is deleted rather than moved.

Two generator gaps the dogfood surfaced: 1. A $keyword literal LEADING an optional group is now captured as <kw>Tok (CstLeaf) - the group's presence marker and position anchor in one field. The try-arm consumer drops its findText('catch') children scan: catchTok / finallyTok / elseTok arrive typed, with spans. 2. Mixed STRUCTURAL alternations no longer flatten to a soup of optionals: each branch compiles to a tagged mini-plan with its own captures, and the parent gets one field holding a per-branch sub-union - alt?: { branch: 'param'; param } | { branch: 'bindingPattern'; ... } - so consumers switch on m.alt.branch instead of probing presence. Branch capture slots are rename-prefixed per attempt (nested alternations keep their slots distinct); the result-object/type keys are frozen at plan time (Capture.field), so renaming never leaks into the API. Generated helper names moved to a __-prefix namespace after a grammar keyword ('is' in TypeScript's `x is T`) minted an isTok capture that shadowed the helper. Totality holds across all seven grammars (0 misses); the tsc-structural oracle stays node-identical with two new battery cases (try-finally-only, catch with a destructuring pattern).

@dec

…ass' The round-8 prerequisite, driven by the 29-file work spec: class DECLARATIONS now cover what tsc parses cleanly (legality being checker territory), so the ES2023 14.5 statement lookahead can finally include 'class' - statement- position class merges (class C {}\n(g)() as one expression) are gone the same way the function ones were. Grammar (both languages, TS shown): - Heritage clauses are REPEATABLE, order-free and multi-typed: many(alt(['extends', sep(elem, ',')], ['implements', sep(elem, ',')])) - tsc parses "extends A extends B", "implements A extends B", "extends A, B" and even a bare "extends {" (zero elements; sep() matches empty, leaving the brace for the body) with NO diagnostics. - A heritage ELEMENT is guarded against the clause keywords (not(alt( 'extends', 'implements'))): they are contextual words the bare-Ident base would otherwise swallow, breaking clause chaining. - implements elements are heritage EXPRESSIONS, not Types - tsc parses "implements A?.B" cleanly; ClassHeritage also gained the '?.' member led and the non-constructor primaries (numbers, strings, true/false/null/ undefined) that classExtendingNonConstructor exercises. - ClassMember accepts the bare ';' element (tsc's SemicolonClassElement). - Decorators may precede ANY declaration ([many1(DecoratorExpr), $] - many1 keeps the self-ref non-left-recursive), covering "@dec export default class {}". The widening also fixed 7 PRE-EXISTING false negatives: total tsc-clean rejects drop 24 -> 17, conformance holds at 5383/5659, and all 29 of the guard-caused regression set parse with the guard ON. gen-tm's dormant optional-chain mint is deleted: its direct-ref shape test never matched any grammar (the expression '?.' led targets an alt()), and when the new heritage led first woke it, the minted entity.other.property scope contradicted the ledger-pinned a?.b contract (variable.other, matching the official grammar). First activation of dead code = its first review. interp=emit on the full corpus, reject messages identical, ts-ast oracle node-identical, 29/29 gates.

@dec

…> 0) Every remaining FN was a construct tsc parses cleanly (validity deferred to its checker) that the grammar rejected. Widen to the proven parse surface, verified per construct against tsc parseDiagnostics: - JSDoc types in normal TS type positions: postfix `T?` / `T!`, prefix `?T` / `!T`, bare `?`, `*`, `function(this: T, string): U`, and dotted type arguments `Array.<number>`. The postfix `?` mirrors tsc's isStartOfType disambiguation via `not(alt('new', $))` (conditional types and `as T ? a : b` ternaries keep parsing) and requires same-line (tsc rejects `T\n?`). - Decorator expressions: optional-chain tails (`@x?.y`, `@x?.()`, `@x?.[i]`), tagged templates (`@x``...```), and `@new x` — the lexer maximal-munches `@new` into one Decorator token, so isKeywordLiteral now classes `@`-headed word literals as keyword-class (text-matched against named tokens; both engines share the mechanism, the emitted lexer's kw-intern tables included). - Decorator placement: `@dec var/let/const x` and `@dec using x` (using requires a real binding: `using 1` stays a reject), `export @dec default class`. - Object literals: full modifier soup (`{ static m() {} }`, `{ export p: 1 }`) and `?` / `!` after any member name (`{ a! }`, `{ a?() {} }`), per tsc's parseObjectLiteralElement; `const`/`default` stay out (tsc parse errors). - Parameters: full modifier soup (`override`/`static`/`export`/`async`/..., the set probed parse-clean one keyword at a time). - Object binding rest: `...r: name` and `...r = init`. - Nameless class declarations (`class { }` at statement level). - Import/export specifiers: `import { "str" as x }`; export side gets its own ExportSpecifier (ModuleExportName on both sides, bare strings allowed). - `typeof import("m").Thing` (ImportTypeNode as a type-query target). - Async arrows with a bare parameter: `async err => ...` was previously mis-split into two statements (`async` + ASI); new same-line arm in both grammars (ES2017), with AsyncKeyword-aware lowering and two new structural oracle cases (lowered AST stays node-by-node = tsc). javascript.ts mirrors only the ES-true widenings (async bare arrow, ES2022 string specifiers); tsc-lenient recovery surfaces stay TS-only. gen-tm: a type-context keyword whose every @type occurrence is followed by `(` (the JSDoc function type) no longer leaves the statement-start set, so multi-line type regions still close at a real `function f() {}` declaration (fixes the issue #1043 regression the new arm introduced). Conformance: FN 17 -> 0 (corpus-wide; tsc-parse-clean files we reject: none), both-accept 5095 -> 5112, over-accept 288 -> 289 (the one new case is exportModifier.2's `abstract @dec class` same-line split — the pre-existing newline-blind-ASI class, now the dominant remaining over-accept dimension). 29/29 gates green.

…rface CI's tree-sitter conflict gate (tree-sitter generate, not in the local gate chain) was red on two counts: the class-declaration widening of the previous commit (class_member) and this round's object-literal / export-specifier / decorator / rest-binding / JSDoc-function-type arms. collect-conflicts fixpoint adds 9 tuples; all seven derived grammars generate cleanly again.

The PR diff was 89% generated *.cst-match.ts (41k of 46k lines), drowning the hand-written changes. Mark every npm-run-gen artifact linguist-generated so GitHub collapses them, and add '*.cst-match.ts' to the CI artifact-sync gate — it was the one committed artifact class the drift check didn't cover (the totality gate only catches staleness that breaks unification, not silent drift).

…d in CI They are pure consumer artifacts derived from the grammar (41k of the PR's 46k added lines); the grammar sources are the truth. Gitignore them, drop them from the artifact-sync diff and from .gitattributes, and move the gen step ahead of Typecheck in CI — the typecheck and the gates import the generated files, so they must exist before either runs.

altMaskDispatch/ftMaskDispatch returned null past 32 alternatives, silently dropping a rule to serial per-alt membership guards. The FN=0 widening pushed BOTH hottest rules over the edge (R_Type_lr to 33 alts, R_Expr_pratt's nud list past 32) and the whole parse paid ~25-27% — a flat profile, because the cost was diffuse admit-checks, not a hotspot. Masks now span words (alt i -> word i>>5, bit i&31; one table pair + one local per word), so dispatch degrades smoothly with grammar growth instead of cliffing; single-word rules emit byte-identical code to before. A/B: +37.8% aggregate over the cliffed build; net cost of the entire parse- surface widening vs its pre-widening baseline is now -2.7%. Prune decisions are unchanged (mask = the same FIRST/SECOND predicate as the serial guards): 18,805-file emit=interp byte-identical, 436 reject messages exact, 29/29.

… labels test/profile-vs-peers.mjs benchmarks the emitted javascript.ts against acorn (the ESTree reference) on real-world files from node_modules — acorn parsing itself, vue compiler-core, parse5's parser, and the 8.9MB typescript.js — and the emitted html.ts against parse5 (the WHATWG reference) on synthesized well-formed fragments, same warmup/min-of-rounds methodology as profile-vs-tsc.mjs. JS lands at parity (0.85-1.1x per file across runs); HTML at ~2.2x, the named gap being that markup-mode lexing is still the interpreted lexer (emit-lexer specialization doesn't cover markup yet). Benching acorn.js exposed a real grammar bug in both ECMAScript grammars: break/continue labels missed the spec's restricted production (`break [no LineTerminator here] Label`) and the reserved-word guard, so `break` newline `case "X":` inside a switch ate `case` as the label and the whole switch cascaded into a reject. Both arms now read `opt(sameLine, notReserved, Ident)`. Conformance matrix unchanged (5112/258/0/289, FN still 0); 29/29 gates; tree-sitter closure complete.

johnsoncodehk added 10 commits June 10, 2026 00:10

johnsoncodehk changed the title ~~Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 2.57x~~ Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 1.97x Jun 9, 2026

johnsoncodehk changed the title ~~Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 1.97x~~ Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 1.24x Jun 9, 2026

johnsoncodehk added 9 commits June 10, 2026 03:17

johnsoncodehk changed the title ~~Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 1.24x~~ Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 0.86x Jun 9, 2026

johnsoncodehk added 3 commits June 10, 2026 05:14

johnsoncodehk added 3 commits June 10, 2026 06:22

johnsoncodehk added 12 commits June 10, 2026 07:35

johnsoncodehk changed the title ~~Profile the emitted parser against tsc and remove the dispatch it exposed: 4.41x -> 0.86x~~ Emitted parser to 0.84x tsc, span-only CST, FN=0 vs tsc's parse surface, generated per-arm destructurers Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emitted parser to 0.84x tsc, span-only CST, FN=0 vs tsc's parse surface, generated per-arm destructurers#35

Emitted parser to 0.84x tsc, span-only CST, FN=0 vs tsc's parse surface, generated per-arm destructurers#35
johnsoncodehk wants to merge 42 commits into
masterfrom
parser-exec-trace

johnsoncodehk commented Jun 9, 2026 •

edited

Loading

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnsoncodehk commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance: 4.41× → 0.84× vs tsc parse-only

Peers: the reference parser of each ecosystem

CST contract: span-only, two shapes, nothing derivable

Acceptance: FN = 0 against tsc's parse surface

Generated consumer toolkit

Gates

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

johnsoncodehk commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

johnsoncodehk commented Jun 9, 2026 •

edited

Loading