splitting UTF-8 strings

Forum Moderators: coopster

Message Too Old, No Replies

splitting UTF-8 strings

tata668

9:11 pm on Feb 21, 2006 (gmt 0)

I'm using UTF-8 everywhere on my site, PHP 4.4.1 with mbstring, and everything is ok.

The only thing I'm not able to do is using "preg_match_all()" with the unicode modifier "/u". It just doesn't work.

$text = "été" ;
preg_match_all('/[\w]+/u', $text, $matchs);
$matchs = $matchs[0];

$matchs would only contains "t" here, as if "é" was not alphanumeric!

In fact what I'm trying to do is a function that would split an UTF-8 string into words.

I tought:

$mots = mb_split("\W", $text) ;

would do the job but it seems that this function is not aware that some extended characters are not alphanumric (ie: "´").

Any help will be really appreciated.

madmac

10:03 pm on Feb 21, 2006 (gmt 0)

You have to set the mb_regex_encoding. It defaults to ISO-8859-1.

mb_regex_encoding('UTF-8');
$mots = mb_split('\W', $text);

tata668

3:45 am on Feb 22, 2006 (gmt 0)

Thanks for the idea madmac but it still doesn't work. I don't think I have to set mb_regex_encoding to UTF-8 because I have "mbstring.internal_encoding = UTF-8" in my php.ini.

But even if I use "mb_regex_encoding('UTF-8');" it doesn't work.

mb_split() still thinks that "´" is alphanumeric even if, I think, this character should be used as a delimiter.

Try it:

mb_regex_encoding('UTF-8');
$words = mb_split("\W", "aaa´bbb") ;

Or, if the editor you use doesn't fully support UTF-8:

mb_regex_encoding('UTF-8');
$words = mb_split("\W", "aaaÂ´bbb") ;

I would be SO happy to find a solution for that problem...

I can't believe it's so hard, I'm only trying to get the words of a UTF-8 string. :-(

The only way I found that seemed to work, was:

preg_match_all('/[\w]+/u', "aaaÂ´bbb", $words);

on my WINDOWS XP box.

On my Linux machine (my main web server) it doesn't work and I have no idea why!

tata668

8:46 pm on Feb 22, 2006 (gmt 0)

Do you think I should I start a new topic with a more precise title?

"spliting UTF-8 strings" maybe..

tata668

5:48 pm on Feb 23, 2006 (gmt 0)

Thanks for the title change!