Forum Moderators: coopster

Message Too Old, No Replies

How to access a single utf-8 character inside a string

         

Grimmjow

1:05 pm on Apr 28, 2010 (gmt 0)

10+ Year Member



edit: I apologize for the typos in the title. It should be "How to access"

I've never came across this problem, and I cannot believe this is not actually working as expected. I hope I'm missing something.

<?php
mb_internal_encoding("UTF-8");

$s = "Seà ut perspiciatis.";

for($i=0;$i<mb_strlen($s);$i++) {
echo $s[$i];
}
?>


This piece of code displays this:

Seà ut perspiciatis


It misses the final ".", I guess because mb_strlen counts the correct number of multibyte characters (20) but the [] operator does not, but I still they read correctly on screen, and it stops before reaching the end of the string.

How can I display all of 20 mb chars inside the loop? Obviously in my application I need them separated to do some stuff, I cannot do a simple echo $s :)

Thank you.

jatar_k

1:27 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



increase the iterations by 1?

for($i=0;$i<=mb_strlen($s);$i++) {

andrewsmd

1:32 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Or do <=. If you do less than then the last character would never be displayed in the manner which you are outputting.

Grimmjow

1:52 pm on Apr 28, 2010 (gmt 0)

10+ Year Member



The "increase by 1" depends by how many multibyte characters are in the string. à is 2-bytes, so the actual bytes are total visible characters + 1, but if I had àè would be +2, and other utf-8 chars are 3-bytes or 4-bytes, it cannot be handled this way.

<= is similar to the above, not always is +1 , but could be +5 or +10, it dpeends on how and which multubyte characters are in the string.

by now I found this function

mb_substr($s,$position,1) that grabs one character (a,à or whatever) at position $position, so the for will work as expected this way:

<?php
for($i=0;$i<mb_strlen($s);$i++) {
echo mb_substr($s,$i,1);
}
?>


I think this is the only way ...

Readie

1:54 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



for($i=0;$i<=mb_strlen($s);$i++) { 

Very inefficient coding. You are calling the mb_strlen() function on every iteration of the loop. Do this instead:

$count = mb_strlen($s);
for($i = 0; $i < $count; $i++) {

jatar_k

2:37 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



>> The "increase by 1" depends by how many multibyte characters are in the string.

I am not sure this is correct, your example above shows this isn't true or the loop wouldn't be able to correctly output the multibyte char, it would have tried to output each byte seperately

echo $s[$i]; seems to properly output the char, not the byte

though I guess it could be tested

Grimmjow

2:43 pm on Apr 28, 2010 (gmt 0)

10+ Year Member



@Readie: thanks for the suggestion :)

@jatar_k: I had the same doubt. I suppose that, in this test case, the browser does the magic and when it finds the 2 bytes that form the à it compacts them and show me the single character, based on the content type delcaration utf-8...

I think in the end, [] always read the single byte, mb_substr(x,y,1) the single character. Otherwise There wouldn't be a way to read a single byte, if anyone had the need to do so.

jatar_k

2:48 pm on Apr 28, 2010 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



quick test using this code

mb_internal_encoding("UTF-8");
// i used a double space like you did in the following string but the forum software keeps cutting it to 1
$s = "Seà ut perspiciatis.";

$counter = 0;
while (isset($s[$counter])) {
echo '<br />',$counter,':',$s[$counter];
$counter++;
}


output

0:S
1:e
2:&#65533;
3:
4:
5:u
6:t
7:
8:p
9:e
10:r
11:s
12:p
13:i
14:c
15:i
16:a
17:t
18:i
19:s
20:.

ok the output switching and the space stripping are difficult but do the same test yourself and see

Grimmjow

12:13 am on Apr 29, 2010 (gmt 0)

10+ Year Member



Ok thanks I got it