homepage Welcome to WebmasterWorld Guest from 54.198.130.203
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
splitting UTF-8 strings
tata668




msg:1278916
 9:11 pm on Feb 21, 2006 (gmt 0)

I'm using UTF-8 everywhere on my site, PHP 4.4.1 with mbstring, and everything is ok.

The only thing I'm not able to do is using "preg_match_all()" with the unicode modifier "/u". It just doesn't work.


$text = "été" ;
preg_match_all('/[\w]+/u', $text, $matchs);
$matchs = $matchs[0];

$matchs would only contains "t" here, as if "é" was not alphanumeric!

In fact what I'm trying to do is a function that would split an UTF-8 string into words.

I tought:


$mots = mb_split("\W", $text) ;

would do the job but it seems that this function is not aware that some extended characters are not alphanumric (ie: "´").

Any help will be really appreciated.

 

madmac




msg:1278917
 10:03 pm on Feb 21, 2006 (gmt 0)

You have to set the mb_regex_encoding. It defaults to ISO-8859-1.

mb_regex_encoding('UTF-8');
$mots = mb_split('\W', $text);

tata668




msg:1278918
 3:45 am on Feb 22, 2006 (gmt 0)

Thanks for the idea madmac but it still doesn't work. I don't think I have to set mb_regex_encoding to UTF-8 because I have "mbstring.internal_encoding = UTF-8" in my php.ini.

But even if I use "mb_regex_encoding('UTF-8');" it doesn't work.

mb_split() still thinks that "´" is alphanumeric even if, I think, this character should be used as a delimiter.

Try it:


mb_regex_encoding('UTF-8');
$words = mb_split("\W", "aaa´bbb") ;

Or, if the editor you use doesn't fully support UTF-8:


mb_regex_encoding('UTF-8');
$words = mb_split("\W", "aaa´bbb") ;

I would be SO happy to find a solution for that problem...

I can't believe it's so hard, I'm only trying to get the words of a UTF-8 string. :-(

The only way I found that seemed to work, was:


preg_match_all('/[\w]+/u', "aaa´bbb", $words);

on my WINDOWS XP box.

On my Linux machine (my main web server) it doesn't work and I have no idea why!

tata668




msg:1278919
 8:46 pm on Feb 22, 2006 (gmt 0)

Do you think I should I start a new topic with a more precise title?

"spliting UTF-8 strings" maybe..

tata668




msg:1278920
 5:48 pm on Feb 23, 2006 (gmt 0)

Thanks for the title change!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved