PCRE \p regular expression syntax

Forum Moderators: coopster

Message Too Old, No Replies

PCRE \p regular expression syntax

\p{LI} sounds a warning in PHP 5.2.3

Marino

10:58 am on Dec 19, 2007 (gmt 0)

Hello,

According to the "reference.pcre.pattern.syntax" page:

-----------------------------
Unicode character properties

Since PHP 4.4.0 and 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. They are:

\p{xx}
a character with the xx property
-----------------------------

... (provided you use the "u" modifier).

I've tried this -rather minimal- regexp :

preg_match_all("/(\p{LI}+)/u",$line,$words);

... which should match sets of lowercase letters, but I got :

Warning: preg_match_all() [function.preg-match-all]: Compilation failed: unknown property name after \P or \p at offset 6 in
/var/www/test/engine/engineGetWords.php on line 74

The '{' seems not to be supported in PHP 5.2.3. It's unexpected, as this regexp works fine, without warning:

preg_match_all("/(\pL+)/u",$line,$words);

Any explanation welcome.

Regards,

Marino

PHP_Chimp

6:28 pm on Dec 19, 2007 (gmt 0)

From the user comments on the pattern modifiers page -

For example we would like to search for Japanese-standard circled numbers 1-9 (Unicode codes are 0x2460-0x2468) in order to make it through the hex-codes the following call should be used:
preg_match('/[\x{2460}-\x{2468}]/u', $str);

This should help to explain the use of the /u modifier

Marino

9:42 am on Dec 20, 2007 (gmt 0)

The use of the "u" modifier is fine, as it stands for 'utf-8'.

Oddly, I've tried your regexp, and I've no warning message. So why is \x{2460} is legit on my system and \p{LI} is not?

Regards,

Marino

4:26 pm on Dec 20, 2007 (gmt 0)

...Damn...

Through a cut/paste, I checked it was "Ll", and not "LI".

preg_match_all("/(\p{Ll})+/u",$line,$words);

-> "l" for "lowercase", and not an upercase "i"...

Marino

PHP_Chimp

6:57 pm on Dec 21, 2007 (gmt 0)

I dont understand what you are using \p{Ll} for. As if you are after matching both L in upper and lowercase then you dont need UTF-8 mode, as these characters are in the 'usual' ISO encoding.
The only times I use utf-8 is when dealing with languages where the ISO-8859-1 doesnt cover those characters. So you use UTF-8 and the hex encoding to allow those ranges of characters in the search. You dont need to use hex as any of the other encodings work, the UTF-8 just give you access to all of those other characters that UK/America dont need.

To be honest im not sure if this is the only way it can be used, but thats the only way iv used it.