|splitting UTF-8 strings|
| 9:11 pm on Feb 21, 2006 (gmt 0)|
I'm using UTF-8 everywhere on my site, PHP 4.4.1 with mbstring, and everything is ok.
The only thing I'm not able to do is using "preg_match_all()" with the unicode modifier "/u". It just doesn't work.
$text = "été" ;
preg_match_all('/[\w]+/u', $text, $matchs);
$matchs = $matchs;
$matchs would only contains "t" here, as if "é" was not alphanumeric!
In fact what I'm trying to do is a function that would split an UTF-8 string into words.
$mots = mb_split("\W", $text) ;
would do the job but it seems that this function is not aware that some extended characters are not alphanumric (ie: "´").
Any help will be really appreciated.
| 10:03 pm on Feb 21, 2006 (gmt 0)|
You have to set the mb_regex_encoding. It defaults to ISO-8859-1.
$mots = mb_split('\W', $text);
| 3:45 am on Feb 22, 2006 (gmt 0)|
Thanks for the idea madmac but it still doesn't work. I don't think I have to set mb_regex_encoding to UTF-8 because I have "mbstring.internal_encoding = UTF-8" in my php.ini.
But even if I use "mb_regex_encoding('UTF-8');" it doesn't work.
mb_split() still thinks that "´" is alphanumeric even if, I think, this character should be used as a delimiter.
$words = mb_split("\W", "aaa´bbb") ;
Or, if the editor you use doesn't fully support UTF-8:
$words = mb_split("\W", "aaaÂ´bbb") ;
I would be SO happy to find a solution for that problem...
I can't believe it's so hard, I'm only trying to get the words of a UTF-8 string. :-(
The only way I found that seemed to work, was:
preg_match_all('/[\w]+/u', "aaaÂ´bbb", $words);
on my WINDOWS XP box.
On my Linux machine (my main web server) it doesn't work and I have no idea why!
| 8:46 pm on Feb 22, 2006 (gmt 0)|
Do you think I should I start a new topic with a more precise title?
"spliting UTF-8 strings" maybe..
| 5:48 pm on Feb 23, 2006 (gmt 0)|
Thanks for the title change!