Forum Moderators: coopster
with a mysterious algo....that's a great spell checker.
An example:
Handy if you are on Unix, you should have a copy of a word dictionary on it. If not, you can download a copy
somewhere if needs be. All in all, the dictionary has 250000 words in it and has sound spelling to compare
queries against ;o)
With the 2, I made a script that would take a word (the query) and compare it to words in the
dictionary.....here is a snippet, with comments to save me going through it...
// Query to lowercase
$word = strtolower($HTTP_POST_VARS['word']);
// Hex equivalent of word
$wordhex = bin2hex($word);
// Metaphone & Soundex of word
$wordmetaphone = metaphone($word);
$wordsoundex = soundex($word);
// First 3 chars & query length
$l1 = substr($word,0,3);
$llen = strlen($word);
// If query is a blob, would do it word by word (increment)
$a = 0;
// Db query to compare query with db words
$q = mysql_query("SELECT id,word FROM keywords WHERE word LIKE '$l1%'");
while($r = mysql_fetch_array($q))
{
// DB words to lowercase
$r['word'] = strtolower($r['word']);
// Put DB details into array and create soundex etc for them
$array[$a][0] = $r['word'];
$array[$a][1] = bin2hex($array[$a][0]);
$array[$a][2] = $metaword = metaphone($r['word']);
$array[$a][3] = soundex($r['word']);
$array[$a][4] = levenshtein($wordhex,$array[$a][1]);
// Soften effect of levenshtein
@$array[$a][5] = (100 / ($array[$a][4] / 2));
// Differences in metaphone between query and db word
$templev1 = levenshtein($array[$a][2],$wordmetaphone);
// Differences in soundex between query and db word
$templev2 = levenshtein($array[$a][3],$wordsoundex) * .75;
// soundex and metaphone differences combined
$lev = $templev1 + $templev2;
$len = strlen($array[$a][0]);
// Amplify db words that are same in length but not a match to query
if($llen == $len&$lev < 2)
{
$lev = $lev * 1.5;
}
// Dampen lev so that mis-spellings come into "range"..its all guesswork
if($array[$a][4]!= '0')
{
$lev = $lev + ($array[$a][4]) * 0.8;
}
$array[$a][6] = $lev;
That probably doesnt look like anything useful until you type something in and see what happens. Soooo, here are a few examples. Anything with a score of 0 is an exact match, anything below 5 is close, anything above 5 is most likely unrelated (and not shown).
(Score is in last field)
Searched for fly¦FL¦F400
Array ( [0] => flype [1] => 666c797065 [2] => FLP [3] => F410 [4] => 4 [5] => 50 [6] => 4.95 )
Array ( [0] => flynn [1] => 666c796e6e [2] => FLN [3] => F450 [4] => 4 [5] => 50 [6] => 4.95 )
Array ( [0] => fly [1] => 666c79 [2] => FL [3] => F400 [4] => 0 [5] => [6] => 0 )
Searched for flights¦FLFTS¦F423
Array ( [0] => flighty [1] => 666c6967687479 [2] => FLFT [3] => F423 [4] => 1 [5] => 200 [6] => 2.3 )
Array ( [0] => flighter [1] => 666c696768746572 [2] => FLFTR [3] => F423 [4] => 3 [5] => 66.666666666667 [6] => 3.4 )
Array ( [0] => flighted [1] => 666c696768746564 [2] => FLFTT [3] => F423 [4] => 4 [5] => 50 [6] => 4.2 )
Array ( [0] => flight [1] => 666c69676874 [2] => FLFT [3] => F423 [4] => 2 [5] => 100 [6] => 2.6 )
Searched for flighs¦FLFS¦F422
Array ( [0] => flix [1] => 666c6978 [2] => FLKS [3] => F420 [4] => 4 [5] => 50 [6] => 4.95 )
Array ( [0] => flighty [1] => 666c6967687479 [2] => FLFT [3] => F423 [4] => 3 [5] => 66.666666666667 [6] => 4.15 )
Array ( [0] => flight [1] => 666c69676874 [2] => FLFT [3] => F423 [4] => 1 [5] => 200 [6] => 3.425 )
Array ( [0] => flies [1] => 666c696573 [2] => FLS [3] => F420 [4] => 3 [5] => 66.666666666667 [6] => 4.15 )
Searched for flies¦FLS¦F420
Array ( [0] => flix [1] => 666c6978 [2] => FLKS [3] => F420 [4] => 3 [5] => 66.666666666667 [6] => 3.4 )
Array ( [0] => flit [1] => 666c6974 [2] => FLT [3] => F430 [4] => 3 [5] => 66.666666666667 [6] => 4.15 )
Array ( [0] => flisky [1] => 666c69736b79 [2] => FLSK [3] => F420 [4] => 4 [5] => 50 [6] => 4.2 )
Array ( [0] => flisk [1] => 666c69736b [2] => FLSK [3] => F420 [4] => 4 [5] => 50 [6] => 4.7 )
Array ( [0] => flip [1] => 666c6970 [2] => FLP [3] => F410 [4] => 3 [5] => 66.666666666667 [6] => 4.15 )
Array ( [0] => flimsy [1] => 666c696d7379 [2] => FLMS [3] => F452 [4] => 3 [5] => 66.666666666667 [6] => 4.9 )
Array ( [0] => flies [1] => 666c696573 [2] => FLS [3] => F420 [4] => 0 [5] => [6] => 0 )
Array ( [0] => fliers [1] => 666c69657273 [2] => FLRS [3] => F462 [4] => 2 [5] => 100 [6] => 4.1 )
Array ( [0] => flier [1] => 666c696572 [2] => FLR [3] => F460 [4] => 1 [5] => 200 [6] => 3.425 )
Array ( [0] => flicky [1] => 666c69636b79 [2] => FLK [3] => F420 [4] => 4 [5] => 50 [6] => 4.2 )
Array ( [0] => flick [1] => 666c69636b [2] => FLK [3] => F420 [4] => 3 [5] => 66.666666666667 [6] => 3.9 )
So the "equations" from the script have quite a good idea of what a "close" match to the word is using the PHP functions.
I have not looked more into the workings of metaphone() or soundex(), I thought I would come here and ask if anyone of making use of them at WW :)
The above script is not perfect, but its nearer to perfect than a pure 1:1 text match. By weighing what a word "sounds" like and how the word is spelled these functions seem to add relevance to the query.
Anyone using these functions for this sort of thing?