jmccormac - 9:22 pm on Jul 18, 2013 (gmt 0)
I think that a lot of web scale attempts at building search engines are theorised by people who really don't understand the fabric of the web when it comes to the numbers of domain names. I would include some of the Google people in that because I consider that they don't understand the web and are merely applying sticking plasters to a rather large overfilled string bag in an attempt to shore up an algorithm with more holes than a lump of activated carbon.
Not all domain names have working websites. Depending on the TLD, most will be a combination of holding pages (or soft 404s as some people call them) and PPC parking. Then there are brand protection registations which may redirect via 301s and 302s to the primary site. Then there are clones and mutants (sites that are not 30xed to the main site and are either complete copies or so slightly different that they appear to be different sites to ordinary search engines. Then there are the compromised websites. Fairly quickly, you get to a core dataset which may be around 9 to 23% of that TLD. And then the real work starts in sorting the active from the abandoned websites.
The technology is easy. The algorithms are tougher but Google can be beaten because it is now in its infinite monkeys mode (the Shakespeare thing) of spidering the web and hoping that its blind crawling will give all that steaming lump of data some relevance. The real genius is no knowing what not to spider.
And the real kicker is that the cargo-cult SEOs and their FUD buddies in Google are destroying the link structure of the web by scaring webmasters into not linking to other sites. That puts any new SE on very shaky ground because they will miss a major section of the web. New websites rarely link to others. Index page outbound links are becoming rarer.