Skip to content

Enumerate constant strings from bounded regex quantifications (a?, a{n}, a{n,m})#5860

Merged
staabm merged 1 commit into
phpstan:2.2.xfrom
phpstan-bot:create-pull-request/patch-r88frxm
Jun 13, 2026
Merged

Enumerate constant strings from bounded regex quantifications (a?, a{n}, a{n,m})#5860
staabm merged 1 commit into
phpstan:2.2.xfrom
phpstan-bot:create-pull-request/patch-r88frxm

Conversation

@phpstan-bot

Copy link
Copy Markdown
Collaborator

Summary

When PHPStan infers preg_match() group shapes, a single constant token followed by the ? quantifier (e.g. b?) used to collapse the whole group to non-falsy-string, losing the constant strings. This PR makes bounded quantifications over constant literals enumerate their possible values so they can combine with the surrounding literals — ~(ab?)~ now infers 'a'|'ab', ~(colou?r)~ infers 'color'|'colour', etc.

Changes

  • src/Type/Regex/RegexGroupParser.php
    • The #quantification branch of walkGroupAst() no longer unconditionally discards the accumulated literals. It now calls the new getQuantifiedLiterals() before nulling them, and restores the enumerated literals after walking the children.
    • New getQuantifiedLiterals(): walks the quantified atom standalone, enumerates its repetitions, and cross-combines them with the literals accumulated before the quantifier. Returns null (non-constant) for unbounded quantifiers, non-literal atoms, or when the combination count exceeds the limit.
    • New repeatLiterals(): builds the literal set for a [min, max] repetition (a?'a'|'', a{2}'aa', a{1,2}'a'|'aa').
    • New named LITERALS_LIMIT constant guards against combinatorial explosion (deeply nested optionals fall back to a plain string type instead of enumerating).
    • While walking the quantified atom, inOptionalQuantification is reset so multi-token concatenations inside an optional (e.g. (a(bc)?d)) still accumulate their literals.
  • tests/PHPStan/Analyser/nsrt/preg_match_shapes.php — updated expectations that now infer precise constant unions instead of non-falsy-string / non-empty-string.
  • tests/PHPStan/Analyser/nsrt/bug-14820.php — new regression test.

Root cause

walkGroupAst() tracks the exact set of constant strings a group can match in onlyLiterals. Any #quantification node reset this to null, because the original constant-string support only handled plain concatenations and alternations. So the reported ? case, and the analogous bounded quantifiers, all fell back to accessory string types.

The fix treats a bounded quantifier as the finite set of repetitions of its atom (a? = ''|'a', a{n} = the n-fold concatenation, a{n,m} = the union over n..m) and combines it with the prefix literals — the same cross-product logic already used for alternations. Unbounded quantifiers remain non-enumerable and keep the previous behavior.

Analogous cases also handled (same quantifier axis)

  • ? / {0,1} (the reported case): ~(ab?)~'a'|'ab'.
  • {n} exact repetition: ~(ab{2}c)~'abbc'.
  • {n,m} bounded range: ~(ab{1,2}c)~'abc'|'abbc'.
  • *, +, {n,} (unbounded): deliberately left non-constant (~(ab*c)~non-falsy-string).
  • Optional over a sub-group / alternation: ~(a(bc)?d)~'abcd'|'ad', ~(a(b|c)?d)~'abd'|'acd'|'ad' (also required resetting the optional-quantification flag so multi-token atoms keep their literals).

Probed but intentionally out of scope: non-capturing groups ((?:...)) still null literals even without a quantifier — that is a separate pre-existing limitation, and accumulating their literals safely would require honoring inline modifiers like (?i:...).

Test

tests/PHPStan/Analyser/nsrt/bug-14820.php covers the reported ? case plus the analogous bounded quantifiers, optional-over-group, optional-over-alternation, and the unbounded * negative case. Verified the test fails without the source change. Existing preg_match_shapes.php assertions were tightened to the new, more precise constant unions.

Fixes phpstan/phpstan#14820

@staabm staabm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it

@staabm staabm requested a review from VincentLanglet June 13, 2026 07:34
…`a{n}`, `a{n,m}`)

- In `RegexGroupParser::walkGroupAst()`, a `#quantification` node previously
  always discarded the accumulated literals (`onlyLiterals(null)`), so any
  quantified atom degraded a constant-string group to `non-falsy-string`.
- Add `getQuantifiedLiterals()`: for a bounded quantification over constant
  literals it walks the quantified atom standalone, enumerates the repetition
  via `repeatLiterals()` (`a?` => `'a'|''`, `a{2}` => `'aa'`, `a{1,2}` =>
  `'a'|'aa'`), and cross-combines the result with the literals accumulated so
  far. Unbounded quantifiers (`*`, `+`, `{n,}`) and non-literal atoms keep
  returning `null` so the group stays non-constant.
- Reset `inOptionalQuantification` while walking the quantified atom so a
  multi-token concatenation inside an optional group (e.g. `(a(bc)?d)`) keeps
  accumulating its literals instead of being nulled.
- Add a named `LITERALS_LIMIT` constant to bail out to a plain string type
  instead of exploding on deeply nested optional/bounded quantifications.
- Update `preg_match_shapes.php` expectations that now infer precise constant
  unions (`(a|bc?)` => `'a'|'b'|'bc'`, `(a(b)?)` => `'a'|'ab'`, etc.).
@staabm staabm force-pushed the create-pull-request/patch-r88frxm branch from 0e6b1ec to d3f4d3b Compare June 13, 2026 08:05
@staabm staabm merged commit 5f35fb0 into phpstan:2.2.x Jun 13, 2026
667 of 671 checks passed
@staabm staabm deleted the create-pull-request/patch-r88frxm branch June 13, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

regex-matching: infer constant strings from zero-or-one quantification

2 participants