Forum Moderators: open

Message Too Old, No Replies

Does Google use Metaphone or similar for fast index lookup?

         

killroy

8:01 am on May 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've recently improved the fuzzy, spelling mistake friendly search for my site, and got into ENDLESS research into lingustics and IR systems. There is a lot more there then I could've imagined. Now I've implemented a few phonetics systesm and stemming algorythms, and have a highly satisfactory multi word, relevancy rating search engine that can do indexed fuzzy lookups in a scalable fashion.

I was wonering if anybody here has the insight if Google uses some sort of phonetic lookup to reduce the index size, or if itdoes strict word matching only, plus spellchecker (I'd implemented aspell for my system) for speed and efficiency?

I guess with a database as large as googles you can rely on pure exacty matching, since you'll have eenough results for even the most obscure misspellings?

Anybody got any insights?

SN

jeremy goodrich

8:27 pm on May 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



They have a hash (afaik) that they use for lookups based on the query (after a bit of preprocessing, like stop words).

For misspellings, I imagine they have another db that gets queried for the 'did you mean...' part of the page.

That way, they get real data, real fast - as well as a 'suggested alternative' if it seems likely (based on historical data) that the searcher intended something else.

What you said about 'fuzzy' is great stuff though - amazing how much an obtuse form of math can improve even the most mundane, routine tasks?

Fuzzy thinking gets you there quicker, but as far as I know, you can successfully implement similar using probabilistic methods. :) Which you can tell was the thinking of the original Google designers.

However, the current Stanford research - uses many concepts / mathematical techniques that are 'fuzzy' so, perhaps they will lean more towards this over time?

The big improvement is really the speed - a fuzzy system can cope better with unkowns & dynamic elements.

killroy

9:08 pm on May 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well the area I was really interested in when I starte my directory, is using visitor data to improve results. Somewhat liek amazon do it.

With the kind of trafic that google has, this should be peanuts, and truly valid, not jstu experimental.

i.e. if on a search for keyword1 60% go ti page 2 and another 30% go further, and only 10% stay on page 1, then the page onje results are not very valid, or at the very least not very usefull. Time to rethink the theming/ranking for those pages under that keyword.

This way you should be able to create a continually improving system, which is unspammable, since there is always a gazillion more real users then a single webmaster who wants to spam. theoretically you should get a google where the "Do you feel lucky" button actualyl works, or better even, the homepage has a link to teh page you were looking for, before you type in any search at all.... or at least so goes the theory of statistical analysis ;)

SN

jeremy goodrich

9:11 pm on May 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ya, but limiting it to 'statistical analysis' means that you have to know something -> before you can skip to the good part.

Using a fuzzy method, the system develops it's own coefficients, and doesn't need human input for those metrics, thus achieving efficiency much faster.

Continually monitoring, iterating, testing, and quantifying is so much work - even for an automated system.

If you have a more organic architecture -> the system responds faster to change.

It's why, for example, in many manufacturing facilities they use fuzzy controllers in stead of expert systems, same results, but less processing power / computational time to reach the end goal.