Page is a not externally linkable
justageek - 4:38 pm on May 14, 2007 (gmt 0)
Ahh...but both methods are the same when you look at what Google has in the patents. We do differ on scope however because I used more than just one index (Google, MSN and Yahoo!). What they've done is they've collected all the words and phrases from their index, that they've collected from web pages with just their crawlers, into a collection of phrases to play with. Either way you look at it we both refer to billions of documents to make groups and decisions. I just chose to use several SERP indexes instead of storing one locally since they all have way bigger machines than I do. And, I would discount single words to nearly zero for any kind of scoring. They obviously have some value so they cannot be discounted completely. This is absolutely true...now you know how I found out how to get your IP banned if you smack the engines to hard and fast! I also had to change how I did things because brute force was slow. I did end up changing to a more methodical way of related seemingly unrelated documents which sped up the process and made it even more reliable. But there was a downside to relating documents in general. You have to know when to stop! Stupid me forgot that. The first time I starting looking at the relationships I amazed. I then pushed the limits on how far you can go and realized there is a point when documents are still related but the relationship is so far apart that I couldn't use it anymore. Getting the distance between documents without saying the relationship to close (100% related) or the relationship was not enough (100% unrelated) drove me nuts! I gave a shortened version of the entire process so as not to give away everything I did. I'm just saying that what Google has in their patents is a good start for them and I can confirm through real life applications I built, that I know from experience the process they describe does indeed work. What I don't know is how to make it work to get people better ranking :-/ Not yet anyway. JAG
This is different in both method and scope, though. They're not using individual words, and they aren't using web pages - they're using co-occurrence of phrases (phrases that appear together - and related phrases) throughout the entire document collection altogether, which is billions of pages. However, if enough of documents were to include technical specifications AND spec sheets on the page, then a connection could be made based on co-occurrence of the phrases.