|New search engines go deep|
(This is a story from the New York Times and all credits are given to them. WebMasterWorld takes no credit for this story. -G)
Jan 31 2001: The New York Times reports that the limits of current search engines’ indexing ability means that they have access to less than one percent of all the pages on the Web.
Up to 500 billion pieces of content are hidden from the search engines, according to search specialist BrightPlanet.com. This un-indexed region of the Web is being dubbed the “deep Web” and BrightPlanet.com estimates that it may be 500 times larger than the surface Web that search engines try to cover.
Bright Planet has published a very interesting white paper about this. Worth reading:
I glanced at the summary and it looks fascinating. I think I might print it out and curl up with a highlighter this weekend.
This was once in Breaking News, but I moved this to Research Topics. It seems like it is better placed in this forum.
There was some kafuffle some time ago triggered by a press release from Bright Planet on the very same subject. Mmmm, I thought, maybe these people are onto something. So I visited their site, and downloaded their brand new search program, which turned out to be a spruced up version of Mata Hari. <sigh> Oh well.
The story has also been picked up recently by NUA, but it is old ews I am afraid.
That being said, I believe they are right about the amount of information available on the net that we never see. Major SEs have no hope of indexing it all. I believe the future is in smaller, very focused search engines set up in a cross referenced network. Others from various parts of the world seem to have the same belief. We will see who is the first to surface.
Woz, I second your motion. Ive surmised for about a year now that the future of Web search will be in smaller focused databases designed for a specific audience. There is no real revenue model now for databases that search the whole net, though maybe a few with limited oveheads or some whizzbang strategy may make it. Google comes to mind.
Smaller specialist engines are suit the architecture of the Net better.
What the Net does better than anything else, is provide information to small interest groups that are otherwise physically disparate.
While some see it as a mass marketing medium, and many sites are designed on this premise, it was never going to last long. If you look at the Dot Com Morgue the great majority of these failed ventures were based on the "mass marketing" premise. As a result they were poorly positioned, badly focused and targeted. Sites targeting specific focused groups, with targeted focused advertising and revenue models may still be doing OK - of course the dedication and low overheads of these small webs helps too.
Remember the Web is just an interconnected set of modes, and indexing the whole web may have been almost possible 6 years ago but now the principle and architecture of the Net is asserting itself.
I call it the Disaggregated Web and hail it's comeback!
Specialist search engines, funded by various models such as PPC, advertising, subscription, volunteerism, and government funding may well be the next trend on the Net. (And I use "the Net" nomenclature deliberately rather than "the Web")
Going further, Small sites may thrive while big ones die. Perhaps Yahoo, which gets over it's bigness by determined efforts to target different groups will certainly survive, but they have to keep on positioning and targeting even better...
I think it was the economist Schumacher (excuse the spelling) who said "Small is beautiful".
..Yep, AV did get it wrong with their old slogan!
Here is one interesting section of the pdf document to me.
|Incomplete Indexing of Surface Web Sites |
Clearly, the engines themselves are imposing decision rules with respect to either depth or breadth of surface pages indexed for a given site. There was also broad variability in the timeliness of results from these engines. Specialized surface sources or engines should therefore be considered when truly deep searching is desired.
First, I've been trying to formulate a query on Google to get a handle on how many ODP pages they have indexed. Searching on: dmoz site:dmoz.org returns 637,000 pages. Since ODP reports 248,706 total category pages, something seems off.
Second, this underlines for me the importance of submitting directory pages to the spiders at search engines. With the possible exception of Google, you cannot simply assume that they will find a given directory entry, even in the ODP.
Third, the deep web seems to present a very real need and opportunity to develop a different kind of search resource, but my guess is that it will need to start in academia -- the way Google did -- and not through commercial concerns.
The size of dmoz is interesting, since you brought it up, tedster. I can't seem to get the number you got, but rather someplace in the
Results 1 - 10 of about 545,000. Search took 0.26 seconds
range, interesting, no?
I've seen, when browsing through google's version of dmoz, some sites that still work with 0 page rank. I think it's because of being booted from teh google db, therefore, even the link from dmoz doesn't count.
The other aspect of this shrinkage is linkrot.
Have to say, I love this thread! When I've had more coffee, I'll probably dive in.
The deep web is fascinating, you could use this site as a perfect example. Where else could you find this many seo/webmaster experts, all chatting and growing the size of the knowledge database? And how hard is it to find this url in the engines?
Just to clarify, the ODP number is from the pdf version of the actual paper from Bright Planet (rencke's link). It represents the number of actual category and sub-category pages in the ODP when BP did their research.
at the bottom of the ODP home page ( [dmoz.org] ) is their more or less real time stat on sites, editors, and categories. Currently it's:
2,347,914 sites - 34,017 editors - 339,288 categories
I'm sure that "Sites" means listings, not the number of unique websites.