Forum Moderators: coopster

Message Too Old, No Replies

Regular expression with foreign characters

Making "e" match "é" and "è"...

         

trafficms

10:31 am on Apr 20, 2007 (gmt 0)

10+ Year Member



I am working on a search engine handling utf-8 encoded text in any language.
Everything is working so far: The search term is recieved from the user, passed on to the database, and matching rows are returned to the browser - all in utf-8 all the way.

Typing certain foreign characters as ñ, é and ô also matches any n, e and o and vice versa (using MySQL's LIKE operator).

The problem appears when I try to highlight the search terms in the resulting page.
This is done using PHP's preg_replace function and in this case ñ only matches ñ, not n, as well as é matches é but not e and so on. The result simply is that some found rows won't have anything highlighted.

Is there a way to make the regex insensitive to these differences (in a similar way that the i modifier makes it case-insensitive i.e. n also matches N)?
I have tried using the u modifier (for utf-8) but it did not seem to have any effect.

eelixduppy

2:23 pm on Apr 20, 2007 (gmt 0)



Welcome to WebmasterWorld!

I haven't had to solve this problem before, but it would seem that you are going to need character classes for this one. For instance, something like this should match all 'e' characters:


$pattern = "/[eE\xC8-\xCB\xE8-\xEB]/";

It would be similar for all 'n's:


$pattern = "/[nN\xD1\xF1]/";

and 'o's:


$pattern = "/[oO\xD2-\xD6\xF2-\xF6]/";

etc...

I'm using the hex code for the different characters to match all of them and using '-' to specify the range of characters.

I believe the above code should work, although I do not have the means to test it right now. For more information, refer to pattern syntax [php.net].

Good luck!

trafficms

3:45 pm on Apr 20, 2007 (gmt 0)

10+ Year Member



Thanks!

Your solution seems reasonable (though I had hoped there was an inbuilt switch for this situation, as it must be quite trivial...)

So first I declare the character classes:
$class = "[...]"; // Whatever I want to be equivalent
Then I replace any matching character in my search term with the class:
$term = preg_replace("/$class/iu", "$class", $term); // Repeat for each class
And at last I match that against the text, that was found and replace it with highlighting markup.

Is there any reason not to type the character classes like this:
$class = "[eéèê]";
Instead of you hex-codes?

Another thing: What would be the difference between using character classes "[eéèê]" and groups "(e¦é¦è¦ê)"?

Again thanks for the help - I hope I've finally come to the right place - and I hope I am not asking too much.

eelixduppy

8:31 pm on Apr 20, 2007 (gmt 0)



>>Is there any reason not to type the character classes like this:

Actually, no, you can do it any way. Typing in the characters would be a little easier to read, though, as you have demonstrated. ;)


$pat = array('e' => '[eéèêë]','n' => '[nñ]','o' => '[oòóôõö]','a' => '[aàáâãäå]','i' => '[iìíîï]','u' => '[uùúûü]','y' => '[yýÿ]');

Now, with this array, you should be able to look for any of the characters you mentioned. But, instead of using 'e', for example, in the pattern, you would use

$pat['e']
. Just make sure you concatenate the strings correctly so that you don't throw any errors. An example:

preg_replace("/".$pat['e']."/i",'#',$string);

Notice how I used the 'i' modifier in the above example. This works for these special characters just as it does with the english alphabet.

>> What would be the difference between using character classes "[eéèê]" and groups "(e¦é¦è¦ê)

The effect would essentially be the same, however, depending on the function you are using, grouping the character would have a different effect. An example of this would be preg_replace(). The first group in the pattern is $1 or \\1, the second group is $2 or \\2, etc... These can be used in the replacement string. Also, a class allows you to specify the characters without the 'OR' (¦). It makes it a little easier to read. If you need to group something you can always put parenthesis around it after the class:

([eéèêë])

Best of luck!

trafficms

2:27 pm on Apr 21, 2007 (gmt 0)

10+ Year Member



I added [cç] and with all the characters listed now I think the search / highlighting is very complete.

Thanks again, you have been very helpful!