Welcome to WebmasterWorld Guest from 50.19.156.133

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Cyrillic strlen

     

fm86

2:05 pm on Jul 14, 2011 (gmt 0)

5+ Year Member



Good day everybody!

I have a problem validating a form whose textarea contains cyrillic characters. I want that every text longer than 450 gets rejected.
With javascript the string has lenght 440 so it passes the pre-submit control. But for PHP strlen(stripslashes($_POST['text'])) returns 792 and so the text is rejected. How can I solve this?

Thanks a lot!

This is my test string:
ет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им открет им откр

penders

3:07 pm on Jul 14, 2011 (gmt 0)

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



It sounds like your $_POST['text'] is a multi-byte string containing unicode (multi-byte) characters. strlen() counts the number of characters assuming single-byte chars (which I don't think is simply the number of bytes) so the figure is too high. You probably need to call mb_strlen() [uk3.php.net] instead which will count the number of characters in a particular encoding.

fm86

3:32 pm on Jul 14, 2011 (gmt 0)

5+ Year Member



You are right, it works. Thanks a lot!

lucy24

7:59 pm on Jul 14, 2011 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



It sounds like your $_POST['text'] is a multi-byte string containing unicode (multi-byte) characters. strlen() counts the number of characters assuming single-byte chars (which I don't think is simply the number of bytes) so the figure is too high.

Many string-length counting functions in many languages (computer langs, not human langs) run into the same problem. The more common non-Roman scripts, including the non-ASCII half of Latin-1, are all in the two-byte block and therefore get counted double-- but only for spaces, not for punctuation. ("Vanilla" punctuation is in the one-byte range but if they've used anything fancy like curly quotes you are in still deeper trouble because now you're in the three-byte range.)

When necessary you can take advantage of the fact that certain bytes only occur as the first element of a two-byte character: in the case of Cyrillic, D0 - D4, or E0 for fancy punctuation.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month