Forum Moderators: open
Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.
original article [salon.com]
I think this article is referring to the uncrawlable pages. Pages on secure servers, pages that need to be logged in to, pages crawlers are told not to crawl (robots.txt), pages with highly excessive session id's and so on and so on. If you think about all the pages that you have to log in to in your daily work as a webmaster you can see where they're getting those sorts of figures from.
Actually, those pages are the ones that Tim (Yahoo) said SiteMatch is aimed at. Uncrawlable.
If they are uncrawlable, there must be a reason. If the reason is that the authors do not want them to be crawled, there is no sense to make crawl them against their will, regardless of how much you would like Google to crawl the pages. That would be violating the authors rights.
If the reason is that the pages employ cloaking or other spamming techniques, intentionally or not, it is up to the site owners to get a clue and fix their pages.
I just do not think that more is better. Google explicitly said they are not going to crawl sites that will only consume resources and pollute their index. The results are clear, their index is the cleanest.
The indexes of competitors are full of garbage and badly sorted results that are consequence of the clueless mentality that more is better. While that mentality persists, Google does not need to do anything keep its large market lead margin until their competitors start seeing a light that shows them how far they are from what it matters.
As others have said, there are good reasons why thousands of pages are uncrawlable - I'm not sure I want my online bank statements showing up in the SERPS.
Can you imagine if Google tried to index all the flight info for every airline in the world? It changes constantly and is usually built in response to a query... even if the airlines tried to produce pages of the most popular flights and times, they would be constantly changing.
And classified ads... do you really want those coming back in your search results? That's what ebay is for.. ;)
The web changes every second. Imagine if XYZ Airline put it's "ON TIME / DELAYED / CANCELLED" notification system on the web and updated it every 10 seconds... for every flight and every airport they flew to.
Then there is deep but somewhat static data, (archived databases, daily updates, etc..). Its up to the owners of data to publish to the web, (ie- generate static html pages from the data and put them into a public web directory and allow them to be indexed), if and when they want to.
10:01:00 ON TIME
10:01:10 ON TIME
10:01:20 ON TIME
10:01:30 ON TIME
10:01:40 ON TIME
10:01:50 ON TIME
10:02:00 ON TIME
10:02:10 DELAYED
10:02:20 ON TIME
10:02:30 DELAYED
10:02:40 ON TIME
10:02:50 DELAYED
10:03:00 ON TIME
10:03:10 DELAYED
10:03:20 DELAYED
10:03:30 DELAYED
10:03:40 DELAYED
10:03:50 DELAYED
10:04:00 DELAYED
10:04:10 DELAYED
10:04:20 DELAYED
10:04:30 CANCELLED
Actually, the brain works in a similar way - most of us retain the definitions of 1,000's of words somewhere in our memories. But I read somewhere that the average adult's monthly vocabulary consists of only a few hundred words. Or was that my wife just describing me at a party the other night. Don't remember right off hand. :-)
Let's make it simpler, Google does NOT crawl dynamic (non-session ID) pages of PR3 or below.
Well I have 3000 pages indexed with PR2 from a Mysql database using php?prod='1' with no session ID, I think your a little mistaken there. Google will crawl anything that is in the main public directory and is not blocked by robots.txt and has short url's.
Google indexes web pages even without links to them this is done by the Google Toolbar or Adsense appearing on the web pages.
Any 1% of the web is pretty good come to think of the huge amount of data is out there.
Interesting things go on in the hidden web, it seems!
Thank godness Googlebot does not crawl all the stuff in the Web.
Amen! The index is already to big.
It just crawls the relevant pages.
There is so much crap in some sectors that the database is already way to big. Of the 142,000,000 listed results for "travel" we could probaby lose 90% of the and not lose a thing. Y! has got the right idea by introducing a bit more human review into the process to improve results. Would be great if they would use the SiteMatch stuff to include more of the "hidden web", including research databases, patent filings and other things that might be of use to people.
As I mentioned before, this stuff is coming out now because Yahoo is pumping SiteMatch. There is a "hidden web", of course, but it is either useless or duplicate content, secured (private) content, or pay-to-access content (subscription stuff). None of this "hidden web" is going ever to get indexed. One other laughable aspect of SiteMatch is that Yahoo claim that if you pay for inclusion, you will be given a ranking equivalent to sites which have not paid. Say you have an unspiderable site (because some plank of a designer used session IDs on every page - I know, because I've inherited one of these). If you pay, either they keep to their word, and you are last page in the SERPs - meaning you've wasted your money, or they are lying and you can buy your juicy spot at the top of the results (thus making their SERPs irrelevent - precisely because they can be bought).
"And classified ads... do you really want those coming back in your search results? That's what ebay is for.. ;) "
I can tell you for a fact that it does give you back classifeds in the serps....here in France we have an ebay clone name of Kelkoo...
In virtually any search term for, which there is less than 10,000 pages competing ( French as a language is not as widely used as french people would like to beleive )....if I search for my word "widgets" in english ..I get back nearly 4 million examples ...( the search for the french translation of "widgets" gives just 9,600 ) ...
search "widgets" in french and guess what .....Kelkoo has the number 2 and the number 3 slots ....the link when clicked goes straight into their classified pages for the subject "widgets"......
Imagine trying to SEO your way around that one ..!
BTW.....The number one slot is taken by a one page redirected cloaked hidden texted etc etc etc etc ....hasn't ever moved off the spot in 3 years even during all dances , updates etc, any time ,any day ,any datacenter ,www2 , www3 etc its always there ......makes you wonder if google really does have the anti cheat stuff it says it does or is it just to frighten us into being good guys and not trying this stuff....
Any body ever meet first hand someone who did get "banned" from google for this ...or are all stories just hearsay?
So anyway the rest of us are just fighting it out over the # 4 and below positions .....the original glass ceiling serps!
( I just know I'm gonna get crisped for the "french" remark ;) ...true nonetheless!...or "vrai comme même" pour les francophiles........
And I have tons of dynamic pages that are less than PR3 that are indexed and rank well. Dynamic content isn't a problem in most cases.