Regex on simple intl' text

Forum Moderators: coopster

Message Too Old, No Replies

Regex on simple intl' text

Text, but nothing but text

Notawiz

1:26 pm on Dec 23, 2004 (gmt 0)

Like on many websites, I use some textarea inputs on forms.
I've read many articles ans tutorials through a Google search, but haven't found so far a simple regex to apply on my textarea's.

What I want is let the users write ordinary text, in English, French, German or Spanish, with all normal text attributes like punctuation, spaces, line feeds or carriage returns.

And validate that it is text, but nothing but text (no commands or html etc.).

So I tried something like:
$valid_comment = eregi("^([a-zA-Zà-üÀ-Ü0-9 \.\!\?\'\-]+)$", $comment);
but it returns false.
Even [[:alnum:][:space:][:punct:]] did not do the trick.
Is it because of the accents, the line feeds or the punctuation, I don't know, but whatever French text I enter, it says that it is not valid.

Most of the tutorials I've read focus on <input> field validation, but not on what I'm looking for.

Could someone supply a link toward useful information?

Regards. (and happy christmas, by the way).

RonPK

4:15 pm on Dec 23, 2004 (gmt 0)

> $valid_comment = eregi("^([a-zA-Zà-üÀ-Ü0-9 \.\!\?\'\-]+)$", $comment);

I don't think that à-ü and À-Ü are valid ranges for regexp usage. Maybe you'll simply have to name all possible characters...

(by the way, eregi() is case-insensitive, so you don't need a-z AND A-Z in your pattern)

Notawiz

5:38 pm on Dec 23, 2004 (gmt 0)

Hello RonPK,

You are right about the eregi being case insensitive.
But it doesn't matter apparently.

As for the bizarre range of characters, I saw these in some posts.
And they do their job in a simple text field (for instance the company name, which in France can have accents.

It's once I tried to apply the regex to a textarea that the expression failed (even the one with the generics as [:alnum:]).

So I'm still stuck with that bizarre problem.

Salsa

6:10 pm on Dec 23, 2004 (gmt 0)

Notawiz, I think that a simpler solution to get the results you want is the strip_tags() function [us2.php.net]. It'll strip out everthing contained within <> that the user might attempt to submit. One of the other nice things about the function is that you can also allow some harmless tags, like <i>, if you wish. Also take a look at htmlentities() [us2.php.net].

I hope this helps.

Notawiz

6:29 pm on Dec 23, 2004 (gmt 0)

Salsa,

Perhaps my explanation is not clear.
What I want to do is check that there is text and nothing but text in the textarea input.
I thought to be able to perform this check with a regular expression.
If the input in the <textarea> validates, I insert it in the database.
Otherwise, I send the user back to his keyboard to correct his submission himself.
I'm not going to correct things in his place, and load all kind of useless stuff in my DB.
Especially since it's a collaborative project, and that other users will have to work with what has been submitted.
So it really has to MATCH the requirements. Letters with accents or not, digits, spaces and punctuation and line feeds, nothing else.

Does that make a bit more sense?

claus

7:09 pm on Dec 23, 2004 (gmt 0)

I'm used to Perl regexps, not the php ones, but everybody tells me they're similar.

As far as i can see, the regexp you typed will catch the stuff you want it to. Have you checked the other parts of your script for possible errors?

Here's a regexp tester btw: [regexlib.com...]

Notawiz

7:36 pm on Dec 23, 2004 (gmt 0)

claus,

Bad luck, it gave "no matches" when I entered a long french text string of several lines, ended with an exclamation mark and some accentuated letters within.

I'm really pulling my hair out on this one.

Salsa

10:51 pm on Dec 23, 2004 (gmt 0)

Okay, I see what you're wanting to do now. I think it might be easier, however, to match the characters that you want to disallow, rather than those that you want to allow. That way you might also tell the culprit what error to look for in his text, like:

<?php 
if (preg_match('!([#-&(-+/<->@\^_{-}])!', $content, $match)) { 
 $error_message = "You used a '$match[1]' in the text field, which is not allowed."; 
 // return to sender. 
} 
else // $content validates. 
?>

I used ASCII ranges here. If that doesn't work for you, maybe just list the characters you want to disallow in the [] rather than using ranges. The only individual characters you'd need to escape are \^-] and your boundary.

I hope this helps.

claus

9:06 am on Dec 24, 2004 (gmt 0)

>> it gave "no matches"

That's really odd - i could make it match perfectly fine (even some special Danish characters that you don't have in French :)). Perhaps it's something with the eregi() function (i'm not very well versed in PHP) - is that one working multiline or does it require a single line?

---

Added: Perhaps your form has URL-encoded the characters before they reach your script, if so your string should be decoded before you test it. This is would decode it in Perl, i'm sure there's something similar for PHP:

$string =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

Just a thought, perhaps it's totally off.

Notawiz

9:29 am on Dec 24, 2004 (gmt 0)

To Salsa and Claus,

Thank you for the leads you gave me.
Due to Christmas I'm off, so I'll check your solutions only next week.
Will let you know how it turned out.
Season's greetings.

coopster

3:09 pm on Dec 24, 2004 (gmt 0)

Note that there is a difference between the POSIX Extended [php.net] (

ereg*

) and Perl-Compatible [php.net] Regular Expression Functions.

Not that it will make a difference here (don't have time to check right now), but just so you are aware.

Notawiz

4:23 pm on Dec 24, 2004 (gmt 0)

Well, was not able to wait till next week, had to know right away if I had a solution here.

I implemented the suggestion of Salsa, and it worked out of the box. Thx, man!

But since I was a bit confused about that ASCII-range thing, I went for a search and found out that the complete set is like this:

!\"#$%&'()*+,-./:;<=>?@[\\]^_`{¦}~

Just in case other users of the fabulistic Webmasterworld might find this piece of info useful.

Merry Christmas to all.