Forum Moderators: coopster

Message Too Old, No Replies

removing all accents

build a search index in multilanguage site

         

SweepeRpl

7:31 pm on Aug 29, 2007 (gmt 0)

10+ Year Member



Hello,

I have a multilanguage website (ie. users from around the world posting in their languages) and I wanna build a search index. Ideally I would want to remove all accents and make all words with chars [a-z] and then put them in the index, so it doesn't matter if anyone searches for 'Countêr-Strike' or 'counterstrike' - the results would be the same.

Now, I have two problems:

a) MySQL gives me this error: Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='
The query is: SELECT id, word FROM search_words WHERE word IN ('£od¼')

b) the a) problem would probably be solved if I could remove the accents. I'm submitting the form through AJAX (using jQuery) and I have this function to 'flatten' all the submitted content:

// clean the text to make it easy to index
//i.e. make lowercase, remove all special characters, etc.
public function cleanText($text) {
$text = htmlentities($text, ENT_COMPAT, "UTF-8");
$text = preg_replace('/&([a-zA-Z])(uml¦acute¦grave¦circ¦tilde¦cedil¦ring);/', '$1', $text); // remove accents
$text = html_entity_decode($text);
$text = strip_tags(strtolower($text));
$text = preg_replace('#[\n\r]#is', ' ', $text); // new line to space
$text = preg_replace('#\b&[a-z]+;\b#', ' ', $text); // remove html entities

// filter out strange characters like ^, $, &, change "it's" to "its"
$chars_match = array('^', '$', '&', '(', ')', '<', '>', '`', '\'', '"', '&#166;', ',', '?', '%', '~', '+', 'www.', 'http://', '[', ']', '{', '}', ':', '\\', '/', '=', '#', '\'', ';', '!', '*', '.', '');
$chars_replace = array(' ', ' ', ' ', ' ', ' ', ' ', ' ', '', '', ' ', ' ', ' ', ' ', '', ' ', ' ', '', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ' , ' ', ' ', ' ', ' ', ' ', ' ', '', '');
for ($i = 0; $i < count($chars_match); $i++) {
$text = str_replace($chars_match[$i], $chars_replace[$i], $text);
}

// the text should be now very clean ;)
return $text;
}

(that's what I've come up with by looking through the net)

The website's encoding is UTF-8.

Yes, I could just build a table with accented chars and their 'flat' versions, but the problem is I expect users from around the world (well, excluding cyrilic, japanese, arabic, etc.) and it would be really hard to find all the accents, etc. I'm looking for something universal ;)

I would appreciate any help. ;)

PS. I've browse through various topics here on similar matter and in many of them there was a link to another thread where the discussion is suppose to be, but when I click the link it returns me a strange 404 error (saying smth like 'sometimes 404 is just a 404' or smth).

Thx.

[edited by: SweepeRpl at 7:33 pm (utc) on Aug. 29, 2007]

eelixduppy

2:23 pm on Sep 4, 2007 (gmt 0)



Welcome to WebmasterWorld, SweepeRpl!

Does this line work?


$text = preg_replace('/&([a-zA-Z])(uml&#166;acute&#166;grave&#166;circ&#166;tilde&#166;cedil&#166;ring);/', '$1', $text);

I don't see how it would but I do not have the ability to check php code right now. I'd say you should do a straight replace, character for character. It shouldn't be too hard; just create an array with each type of accented character and its "flattened" counterpart and do the replacement. I'm not sure if there is an easier way to get the same effect.