Skip to content

mb_check_encoding doesn't respect UTF-8 RFC3629 #22279

@LLyaudet

Description

@LLyaudet

Description

Hello,

RFC3629 says:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.

But both of mb_check_encoding and preg_match('//u', $input) doesn't detect when such characters occurs.
I found this whilst validating my code here:
Seldaek/jsonlint#91

I had to comment out the fast path relying on mb_check_encoding and preg_match.

Maybe there are related security issues.

Have a nice day, best regards,
Laurent Lyaudet

PHP Version

PHP 8.5.4 (cli) (built: May 25 2026 12:19:37) (NTS)
Copyright (c) The PHP Group
Built by Ubuntu
Zend Engine v4.5.4, Copyright (c) Zend Technologies
    with Zend OPcache v8.5.4, Copyright (c), by Zend Technologies

Operating System

Ubuntu 26.04

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions