Page is a not externally linkable
superscript - 4:53 pm on Dec 14, 2003 (gmt 0)
Is Google broken? - not exactly ---------------------------- Why the change of heart? It's not exactly a change of heart, because it's not Google that is faulty, it's the filter. Google has applied a new and sophisticated filter. It may be of the Bayesian-type. But what all these advanced filters have in common is that they need training. Humans need to feed the filter with data - with examples of good practice, and bad practice. If it is a spam filter, it needs to be given examples of spam, and also examples of non-spam. As in all data analysis, whatever the quality of the algo, rubbish in = rubbish out. Advanced filters fail when they are given an insufficiently large data set - 3+ billion pages requires a very large sample data set. And what is fed into the filter still requires human judgement (make no mistake - Google doesn't like direct human input into the SERPs, but it is clear from their statement about spam reports that human input is entered to modify algos.) Bayesian filters are also in danger of failing (recording false positives) if they are not biased 'against' false positives, rather than towards them (for example, in my own Baysian spam e-mail filter, I have it biased against deleting an e-mail which could be potentially important.) But the initial data feed is crucial. It has to be both statistically significant, and the human judgement that goes into it must be unbiased. It is not difficult for a single user to decide fairly unambiguously what he/she regards as e-mail spam, and what is not e-mail spam. But if such judgements have been made by a team regarding the vast content of the Internet - with its spam sites, sex sites, academic sites and commercial sites - monitoring the quality of these judgements would have to be done extremely carefully. Indeed, such a judgemental task may be impossible! A poorly fed, and unintentionally biased Bayesian filter could explain a great deal (and a bias towards academic sites would be understandable given the intellectual quality of Google's employees.) Commercial sites are not spam sites per se, but they have certain attributes in common. They are likely to share some characteristics such as word repetition - some a consequense of SEO, but some unavoidable if you sell many versions of the same product! A poorly trained Bayesian filter could easily mistake one for the other - a commercial site for a spam site - based on unintentional bias in its training, and a small data sample. If such a filter is in place, dispassionately speaking (as someone who has lost his top positions) it probably hasn't done a bad job statistically as a filter. But it has ruined the SERPs for many significant search terms. As such, although it might work on paper - it appears to have failed.
I called it a fresh look. Here's a fresher look:
Is a filter in place? - yes
Is it a commercial filter? - surprisingly, no!
Is it a faulty filter? - yes