Forum Moderators: coopster

Message Too Old, No Replies

form validation for chinese - problem

         

kbts

4:19 am on May 6, 2008 (gmt 0)

10+ Year Member



Hi,

I'm trying to validate a textfield from my web form that allows english, numbers, the hyphen, the underscore and chinese characters to be entered in. How I understand chinese characters are encoded is something like: &#(some numbers);

--------------------------------
Thus, I try to validate by:
if(mb_eregi("^[[:alnum:]_-&#;]{2,240}$", stripslashes(trim($_POST['input'])))){
$input = escape_data($_POST['input']);
} else {
$u = FALSE;
echo '<p><font color="red" size="1">Error!</font></p>';

It goes into the 'else' and gives me this error: mb_eregi() [function.mb-eregi]: mbregex compile err: empty range in char class

-----------------------------------
Then, I try:
if(eregi("^[a-zA-Z0-9_-#&;]{2,240}$", stripslashes(trim($_POST['input'])))){
$input = escape_data($_POST['input']);
} else {
$input = FALSE;
echo '<p><font color="red" size="1">Error!</font></p>';
}

It goes into the 'else' and I get this error: eregi() [function.eregi]: REG_ERANGE

--------------------------------------
Then, I try:

if(mb_eregi("^[a-zA-Z0-9_-#&;]{2,240}$", stripslashes(trim($_POST['input'])))){
$input = escape_data($_POST['input']);
} else {
$input = FALSE;
echo '<p><font color="red" size="1">Error!</font></p>';
}

It goes into the 'else' and I get this error: mb_eregi() [function.mb-eregi]: mbregex compile err: empty range in char class

Any help in this matter (on validating for both chinese and english for a single textfield) would be very much appreciated.

Thank you,
kbts

PHP_Chimp

12:44 pm on May 6, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try escaping the - within the character class. As at the moment the answer is probably not what you expect.

("^[[:alnum:]_\-&#;]

Would you not be better looking for English OR Chinese?

"^[A-Za-z0-9_-]¦&#[0-9]{1,5};$"
As the first pattern would give you the English then the second pattern would give you any encoded string. You could alter the pattern to suit the encoding that you are using, as you may or may not need a-f to cope with encoding.
You may also need to increase the number of numbers you are allowed for the Chinese pattern, as I cant remember how many numbers you need (I think it was 5, but may well be wrong).

kbts

2:20 am on May 7, 2008 (gmt 0)

10+ Year Member



Yes! Thanks PHP_Chimp! I tried "^[A-Za-z0-9_-]¦&#[0-9]{1,5};$", I did some simple tests and it's working great. (Yes, you're right about it being 5 digits)

I'm a beginner at php and this string pattern topic. :p

I have a question though. I tried "^[[:alnum:]-_]¦&#[0-9]{1,5};$"
And I get the following results. (In case anyone didn't know the &#39295; is a chinese character.)

This string would give me error: !-_-_@we&#39295;!ic-_
but this string wouldn't: -_-_@we&#39295;!ic-_

Similarly, this string would give me error: @-_-_@we&#39295;!ic-_
while this string wouldn't: -_-_@we&#39295;!ic-_

Why may that be? I thought 'alnum' allowed letters and digits only, but it seems to allow the @ and ! also as long as they're not the first character of the input.

Thanks!
kbts

PHP_Chimp

5:42 am on May 7, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The pattern is only a single character long. So you need to stick a + in there to make this 1 or more, or use {} if you want to limit the number of characters this pattern will match, maybe the {2,240} from your first pattern.

"^(?:[[:alnum:]_-]¦&#[0-9]{5};)+$"

If all Chinese characters are 5 digits long then this should work, as now the 'Chinese' pattern is a fixed length.

Within a character class the - specifies a range of characters, so unless it is the first or last character you should escape it. As I dont usually use ereg I am not sure if having - after [:alnum:] will produce some sort of weird range of characters or not. So unless you are sure about the effects of not escaping the - then either put it at the end or escape it.

[edited by: eelixduppy at 7:00 am (utc) on May 9, 2008]
[edit reason] disabled smileys [/edit]

kbts

6:16 am on May 9, 2008 (gmt 0)

10+ Year Member



Thanks for your response PHP_Chimp. I've been trying to test this myself before I post back.

I tried this pattern:
"^([[:alnum:]_-])¦(&#[[:digit:]]{5};){2,16}$"

and it let a@a pass but errors on @aa. This seems like what's happening before. So I thought maybe my {2,16) isn't applied to both sides of the OR.

So, I tried putting a bracket around it:
"^(([[:alnum:]_-])¦(&#[[:digit:]]{5};)){2,16}$"

When I tried the input &#25105;b&#25105;_- it errors. but all those characters either are alphanumeric, underscore, hyphen or chinese character.

What do you see wrong in this?

Thanks very much for your help. I wouldn't have thought of the OR or the chinese pattern :p

kbts

[edited by: eelixduppy at 7:00 am (utc) on May 9, 2008]
[edit reason] disabled smileys [/edit]

kbts

6:22 am on May 9, 2008 (gmt 0)

10+ Year Member



Oh, by the way, for your pattern: "^(?:[[:alnum:]_-]¦&#[0-9]{5};)+$"

Is the ? being applied to the [[:alnum:]_-], thus the : after the ??

Thanks,
kbts

[edited by: eelixduppy at 7:01 am (utc) on May 9, 2008]
[edit reason] disabled smileys [/edit]

PHP_Chimp

7:22 pm on May 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The (?: means that this pattern is not captured. So I was using the ()'s to group the 'English' or 'Chinese'. Seeing as you were not using a capturing pattern in your first example not capturing should make this run a bit faster (probably only a thousandth of a second or so...so not really a huge issue).

if (preg_match('%^(?:[\w-]¦&#[0-9]{5};){2,16}$%', $_POST['input']) === 1) {
$input = escape_data($_POST['input']);
}
else {
$u = FALSE;
echo '<p><font color="red" size="1">Error!</font></p>';
}

Using the \w (word character) may help with the Chinese problem, as it is local specific. So assuming a Chinese local on the server then Chinese characters would fall into the word character class. So you may then be able to get rid of the alternate pattern.

I have changed to preg, as that is what I usually use. So hopefully that works for you.

[edited by: eelixduppy at 8:20 pm (utc) on May 9, 2008]
[edit reason] disabled smileys [/edit]

kbts

9:02 pm on May 10, 2008 (gmt 0)

10+ Year Member



Hi, at this point, I don't think it's a pattern problem anymore. I broke up the problem into a smaller piece, so now, I'm only validating that the input is a Chinese character.

I tried this pattern:
if(eregi("^&#[0-9]{5};$", $_POST['input'])){
$u = escape_data($_POST['input']);
}
else {
$u = FALSE;
echo '<p class="err">Error</p>';
}

If $input is '&#(5 digits);' (actually typing out 8 characters into the html form), given this string is a valid chinese character, it doesn't error.
However if $input is a chinese character (typing a chinese character into the html form), it errors.
Same thing happens if I used mb_eregi() (which gives multibyte character support) instead of eregi().

So, I think this problem might have something to do with the encoding.

I found a list of php functions that deals with multibyte characters: [ca.php.net...] which may help in this matter. I'm exploring this list, but any ideas are very welcome.

Thanks,
kbts

kbts

12:54 am on May 11, 2008 (gmt 0)

10+ Year Member



And PHP_Chimp, I did try with your preg_match pattern, but it's erroring if I have Chinese characters in the input. (It errors on @ and ! just fine though :)) It also errors when the $input is '&#(5 digits);'

Thank you for your continual responses,
kbts

kbts

6:34 am on May 11, 2008 (gmt 0)

10+ Year Member



I think I finally got it.

Here's the solution:
- Make sure you're using the same encoding throughout. (I'm using utf-8)

To check your current encodings:

echo "current mb_internal_encoding: ".mb_internal_encoding()."<br />";
echo "current mb_regex_encoding: ".mb_regex_encoding()."<br />";

To change encodings to utf-8:
mb_internal_encoding("UTF-8"); 
mb_regex_encoding("UTF-8");

if(mb_eregi("^[[:alnum:]_-]*$", $_POST['input'])){
if ((mb_strlen($_POST['input']) < 2) ¦ (mb_strlen($_POST['input']) > 16)) {
echo '<p><font color="red" size="1">Error: Input must be between 2-16 characters.</font></p>';
}
else {
$u = $_POST['input'];
}
} else {
$u = FALSE;
echo ''<p><font color="red" size="1">Error</font></p>';
}

I hope this helps someone else on the internet. :)

Cheers,
kbts