Enumerate constant strings from bounded regex quantifications (a?, a{n}, a{n,m})#5860
Merged
staabm merged 1 commit intoJun 13, 2026
Merged
Conversation
…`a{n}`, `a{n,m}`)
- In `RegexGroupParser::walkGroupAst()`, a `#quantification` node previously
always discarded the accumulated literals (`onlyLiterals(null)`), so any
quantified atom degraded a constant-string group to `non-falsy-string`.
- Add `getQuantifiedLiterals()`: for a bounded quantification over constant
literals it walks the quantified atom standalone, enumerates the repetition
via `repeatLiterals()` (`a?` => `'a'|''`, `a{2}` => `'aa'`, `a{1,2}` =>
`'a'|'aa'`), and cross-combines the result with the literals accumulated so
far. Unbounded quantifiers (`*`, `+`, `{n,}`) and non-literal atoms keep
returning `null` so the group stays non-constant.
- Reset `inOptionalQuantification` while walking the quantified atom so a
multi-token concatenation inside an optional group (e.g. `(a(bc)?d)`) keeps
accumulating its literals instead of being nulled.
- Add a named `LITERALS_LIMIT` constant to bail out to a plain string type
instead of exploding on deeply nested optional/bounded quantifications.
- Update `preg_match_shapes.php` expectations that now infer precise constant
unions (`(a|bc?)` => `'a'|'b'|'bc'`, `(a(b)?)` => `'a'|'ab'`, etc.).
0e6b1ec to
d3f4d3b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When PHPStan infers
preg_match()group shapes, a single constant token followed by the?quantifier (e.g.b?) used to collapse the whole group tonon-falsy-string, losing the constant strings. This PR makes bounded quantifications over constant literals enumerate their possible values so they can combine with the surrounding literals —~(ab?)~now infers'a'|'ab',~(colou?r)~infers'color'|'colour', etc.Changes
src/Type/Regex/RegexGroupParser.php#quantificationbranch ofwalkGroupAst()no longer unconditionally discards the accumulated literals. It now calls the newgetQuantifiedLiterals()before nulling them, and restores the enumerated literals after walking the children.getQuantifiedLiterals(): walks the quantified atom standalone, enumerates its repetitions, and cross-combines them with the literals accumulated before the quantifier. Returnsnull(non-constant) for unbounded quantifiers, non-literal atoms, or when the combination count exceeds the limit.repeatLiterals(): builds the literal set for a[min, max]repetition (a?→'a'|'',a{2}→'aa',a{1,2}→'a'|'aa').LITERALS_LIMITconstant guards against combinatorial explosion (deeply nested optionals fall back to a plain string type instead of enumerating).inOptionalQuantificationis reset so multi-token concatenations inside an optional (e.g.(a(bc)?d)) still accumulate their literals.tests/PHPStan/Analyser/nsrt/preg_match_shapes.php— updated expectations that now infer precise constant unions instead ofnon-falsy-string/non-empty-string.tests/PHPStan/Analyser/nsrt/bug-14820.php— new regression test.Root cause
walkGroupAst()tracks the exact set of constant strings a group can match inonlyLiterals. Any#quantificationnode reset this tonull, because the original constant-string support only handled plain concatenations and alternations. So the reported?case, and the analogous bounded quantifiers, all fell back to accessory string types.The fix treats a bounded quantifier as the finite set of repetitions of its atom (
a?=''|'a',a{n}= the n-fold concatenation,a{n,m}= the union overn..m) and combines it with the prefix literals — the same cross-product logic already used for alternations. Unbounded quantifiers remain non-enumerable and keep the previous behavior.Analogous cases also handled (same quantifier axis)
?/{0,1}(the reported case):~(ab?)~→'a'|'ab'.{n}exact repetition:~(ab{2}c)~→'abbc'.{n,m}bounded range:~(ab{1,2}c)~→'abc'|'abbc'.*,+,{n,}(unbounded): deliberately left non-constant (~(ab*c)~→non-falsy-string).~(a(bc)?d)~→'abcd'|'ad',~(a(b|c)?d)~→'abd'|'acd'|'ad'(also required resetting the optional-quantification flag so multi-token atoms keep their literals).Probed but intentionally out of scope: non-capturing groups (
(?:...)) still null literals even without a quantifier — that is a separate pre-existing limitation, and accumulating their literals safely would require honoring inline modifiers like(?i:...).Test
tests/PHPStan/Analyser/nsrt/bug-14820.phpcovers the reported?case plus the analogous bounded quantifiers, optional-over-group, optional-over-alternation, and the unbounded*negative case. Verified the test fails without the source change. Existingpreg_match_shapes.phpassertions were tightened to the new, more precise constant unions.Fixes phpstan/phpstan#14820