Forum Moderators: coopster

Message Too Old, No Replies

Faster levenshtein implementation.

Anybody know a way?

         

jecasc

8:04 pm on Mar 6, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I run an online shop and wanted to improve the search with some fuzzy logic. To do this I created an index of all words in all product names and created a Levenshtein MySQL stored function in my database.

This is the code I found and used:

[codejanitor.com...]

So when someone types in "wodget" and this returns no results I do a MySQL Query that returns the most plausible alternative, for example widget:

$sql = "SELECT * FROM table ORDER BY LEVENSHTEIN(table.word, '$keyword') ASC LIMIT 1";

I then do a new search with "widget".

However this is quite slow. Depending on the number of letters it can take 2-4 seconds until I get a result. This is ok for a normal search, however I planned to use it in my Ajax "search suggestion", and for this it is too slow.

Anybody have an idea how I could improve the speed? Is the PHP levenshtein function faster? Maybe I could put all words into an array in PHP and then sort them by levenshtein ratio?

Matthew1980

8:20 pm on Mar 6, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi there Jecasc,

I had never heard of this function/algorithm until I read your post - and upon reading the manuals for both mysql (which isn't really very helpful) and php, it turns out that the php version [uk.php.net] seems to be able to handle this better.

I suppose that you would need to make a custom function for this, but I am now intrigued for the answer to this question, but from what I read, it seems that php is definitely the route to go.

However, if there is anyone out there who can shed light on this, that would be great; I'm going to research this a little more to see if I can find a decent comprehensive answer.

Have fun & good luck,

Cheers,
MRb

jecasc

9:33 pm on Mar 6, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I spent the last hour in writing a short script to try the PHP levenshtein function. It was faster then the MySql solution - but the quality of the results was very dissapointing compared to the MYSQL function in the link above. The algorithm in the link above seems to be far superior to the PHP levenshtein function - were the results seemed almost random. I thought all the algorithms worked the same and would return the same levenshtein distance and I should get the same results regardless of the function. However this seems not to be the case.

Problem is I am not very good with algorithms. I can use them but I am not very good at understanding the math involved.

What I found out - it seems the PHP function does no transposition checking. For example:

levenshtein('teusday', 'tuesday')
will return a levenshtein distance of 2 and
levenshtein('teusday', 'thursday')
will also return a distance of 2.

Since many typos are transpositions of letters this would of course limit its usefulness. I don't understand the mysql code good enough to know if this code does, but this would explain the better results.

It seems however that PHP is the way to go when I want a faster function. Now all I need is a better function than the one that comes with PHP.

jecasc

9:34 am on Mar 7, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, I think I found a solution. It seems the MYSQL code is not levenshtein but Damareau-Levenshtein

[en.wikipedia.org...]

I found a PHP implementation of Damerau Levenshtein here in case someone is interested. The results are much better than the levenshtein function that comes with PHP and it is much faster than the MYSQL solution. I did some benchmarking and for longer keywords it was up to 6 times faster.

[worldlingo.com...]

The function as it is implemented in the link above is case sensitive. I got the best results when converting all input strings to lowercase.