googlebot only reaches 1% of web!?!

Forum Moderators: open

Message Too Old, No Replies

googlebot only reaches 1% of web!?!

leoo24

3:00 pm on Mar 9, 2004 (gmt 0)

came across this article today which states that our little friend googlebot only reaches 1% of the internet's pages!

Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.

original article [salon.com]

mlemos

7:57 pm on Mar 10, 2004 (gmt 0)

Thank godness Googlebot does not crawl all the stuff in the Web. That is why it is the best. It just crawls the relevant pages. Relevant pages have links pointing at them in other relevant pages.

internetheaven

8:08 pm on Mar 10, 2004 (gmt 0)

I don't think that Google have 'cleverly' devised a way not to pick up these pages. If it did, it would not have a problem with the completely non-relevant spam that fills its index.

I think this article is referring to the uncrawlable pages. Pages on secure servers, pages that need to be logged in to, pages crawlers are told not to crawl (robots.txt), pages with highly excessive session id's and so on and so on. If you think about all the pages that you have to log in to in your daily work as a webmaster you can see where they're getting those sorts of figures from.

Actually, those pages are the ones that Tim (Yahoo) said SiteMatch is aimed at. Uncrawlable.

stevenmusumeche

8:11 pm on Mar 10, 2004 (gmt 0)

If you have ever used an academic database such as LexisNexis, you would realize that there is a HUGE amount of important and relevant data on the web that Google can't/won't make available.

decaff

10:14 pm on Mar 10, 2004 (gmt 0)

I've seen numbers mentioned as high as a trillion pages in the deep web (and counting rapidly)...
If google has indexed somewhere in the neighborhood of 5 billion pages then certainly that falls in the 0.5% of pages indexed range...

MikeNoLastName

11:13 pm on Mar 10, 2004 (gmt 0)

I think they NEED to crawl and index them all and then offer a sliding bar option to searchers. Top 10%, Top 20%, 100%. How many times have YOU switched to another search engine for a particular search because what you wanted apparently wasn't good enough for Google, and then found exactly what you wanted elsewhere? The option to "search-deeper" when "no results" are found would be awesome.

leoo24

11:17 pm on Mar 10, 2004 (gmt 0)

well that's were their quest to find the best results come in, it's al us seo's that they gotta get past ;)

mlemos

11:36 pm on Mar 10, 2004 (gmt 0)

I don't think it should be up to Google to crawl uncrawlable pages.

If they are uncrawlable, there must be a reason. If the reason is that the authors do not want them to be crawled, there is no sense to make crawl them against their will, regardless of how much you would like Google to crawl the pages. That would be violating the authors rights.

If the reason is that the pages employ cloaking or other spamming techniques, intentionally or not, it is up to the site owners to get a clue and fix their pages.

I just do not think that more is better. Google explicitly said they are not going to crawl sites that will only consume resources and pollute their index. The results are clear, their index is the cleanest.

The indexes of competitors are full of garbage and badly sorted results that are consequence of the clueless mentality that more is better. While that mentality persists, Google does not need to do anything keep its large market lead margin until their competitors start seeing a light that shows them how far they are from what it matters.

encyclo

12:45 am on Mar 11, 2004 (gmt 0)

It's marketing fluff from Yahoo trying to denigrate Google's SERPS (which they can now that they've dumped Google results). It's a sales pitch for SiteMatch, nothing more.

As others have said, there are good reasons why thousands of pages are uncrawlable - I'm not sure I want my online bank statements showing up in the SERPS.

Chndru

2:25 am on Mar 11, 2004 (gmt 0)

right on, encyclo

scottiecla

3:16 am on Mar 11, 2004 (gmt 0)

There is tons of stuff out there that would simply clog the results... take flight information for example.

Can you imagine if Google tried to index all the flight info for every airline in the world? It changes constantly and is usually built in response to a query... even if the airlines tried to produce pages of the most popular flights and times, they would be constantly changing.

And classified ads... do you really want those coming back in your search results? That's what ebay is for.. ;)

lexipixel

3:20 am on Mar 11, 2004 (gmt 0)

Information held in databases, or "unreachable / deep web" data as you call it is not meant to be indexed by search engines or directories.

The web changes every second. Imagine if XYZ Airline put it's "ON TIME / DELAYED / CANCELLED" notification system on the web and updated it every 10 seconds... for every flight and every airport they flew to.

Then there is deep but somewhat static data, (archived databases, daily updates, etc..). Its up to the owners of data to publish to the web, (ie- generate static html pages from the data and put them into a public web directory and allow them to be indexed), if and when they want to.

10:01:00 ON TIME
10:01:10 ON TIME
10:01:20 ON TIME
10:01:30 ON TIME
10:01:40 ON TIME
10:01:50 ON TIME
10:02:00 ON TIME
10:02:10 DELAYED
10:02:20 ON TIME
10:02:30 DELAYED
10:02:40 ON TIME
10:02:50 DELAYED
10:03:00 ON TIME
10:03:10 DELAYED
10:03:20 DELAYED
10:03:30 DELAYED
10:03:40 DELAYED
10:03:50 DELAYED
10:04:00 DELAYED
10:04:10 DELAYED
10:04:20 DELAYED
10:04:30 CANCELLED

caveman

3:50 am on Mar 11, 2004 (gmt 0)

The notion of the hidden Web is nothing new. But your average searcher is not looking for things that are hiding in the hidden Web.

Actually, the brain works in a similar way - most of us retain the definitions of 1,000's of words somewhere in our memories. But I read somewhere that the average adult's monthly vocabulary consists of only a few hundred words. Or was that my wife just describing me at a party the other night. Don't remember right off hand. :-)

pcgamez

5:32 am on Mar 11, 2004 (gmt 0)

Let's make it simpler, Google does NOT crawl dynamic (non-session ID) pages of PR3 or below.

slade7

6:00 am on Mar 11, 2004 (gmt 0)

Let's make it simpler, Google does NOT crawl dynamic (non-session ID) pages of PR3 or below

Well how do you explain all those pr3 & below dynamic (non-session ID) pages in google's cache & index?

for a random example...
h**p://www.gatech.edu/news/item.php?id=230

lasko

6:28 am on Mar 11, 2004 (gmt 0)

Let's make it simpler, Google does NOT crawl dynamic (non-session ID) pages of PR3 or below.

Well I have 3000 pages indexed with PR2 from a Mysql database using php?prod='1' with no session ID, I think your a little mistaken there. Google will crawl anything that is in the main public directory and is not blocked by robots.txt and has short url's.

Google indexes web pages even without links to them this is done by the Google Toolbar or Adsense appearing on the web pages.

Any 1% of the web is pretty good come to think of the huge amount of data is out there.

buckworks

6:39 am on Mar 11, 2004 (gmt 0)

At Search Engine Strategies, one speaker reported that by comparing the traffic sources in his logs with the backlinks reported by the search engines, he discovered that links in places no search engine seemed to have found sent nearly one-third of his traffic.

Interesting things go on in the hidden web, it seems!

skibum

6:56 am on Mar 11, 2004 (gmt 0)

Thank godness Googlebot does not crawl all the stuff in the Web.

Amen! The index is already to big.

It just crawls the relevant pages.

There is so much crap in some sectors that the database is already way to big. Of the 142,000,000 listed results for "travel" we could probaby lose 90% of the and not lose a thing. Y! has got the right idea by introducing a bit more human review into the process to improve results. Would be great if they would use the SiteMatch stuff to include more of the "hidden web", including research databases, patent filings and other things that might be of use to people.

encyclo

12:32 pm on Mar 11, 2004 (gmt 0)

Here's an example of the hidden web: Google claims to have 4,285,199,774 pages in it's index. Very few of them have a noarchive meta tag - let's say 1 million of them. That means that Google has 4,284,199,774 cached pages stored, none of which are indexed or spidered (see robots.txt). If Googlebot had spidered the whole web, then you (or rather Yahoo, in this case) can still claim that only 50% of the web is in the index. This is ridiculous, I know, but Yahoo's figures are based on such fallacies.

As I mentioned before, this stuff is coming out now because Yahoo is pumping SiteMatch. There is a "hidden web", of course, but it is either useless or duplicate content, secured (private) content, or pay-to-access content (subscription stuff). None of this "hidden web" is going ever to get indexed. One other laughable aspect of SiteMatch is that Yahoo claim that if you pay for inclusion, you will be given a ranking equivalent to sites which have not paid. Say you have an unspiderable site (because some plank of a designer used session IDs on every page - I know, because I've inherited one of these). If you pay, either they keep to their word, and you are last page in the SERPs - meaning you've wasted your money, or they are lying and you can buy your juicy spot at the top of the results (thus making their SERPs irrelevent - precisely because they can be bought).

blaze

12:36 pm on Mar 11, 2004 (gmt 0)

Skibum - suggestion: stop using the internet.

The brain is only going to get bigger. It will, however, get a bit more organised. The only question is if the organisation will be AI or $.

And, the other question is - what side will MSN come down on?

Leosghost

1:03 pm on Mar 11, 2004 (gmt 0)

scottiecla..

"And classified ads... do you really want those coming back in your search results? That's what ebay is for.. ;) "

I can tell you for a fact that it does give you back classifeds in the serps....here in France we have an ebay clone name of Kelkoo...
In virtually any search term for, which there is less than 10,000 pages competing ( French as a language is not as widely used as french people would like to beleive )....if I search for my word "widgets" in english ..I get back nearly 4 million examples ...( the search for the french translation of "widgets" gives just 9,600 ) ...
search "widgets" in french and guess what .....Kelkoo has the number 2 and the number 3 slots ....the link when clicked goes straight into their classified pages for the subject "widgets"......
Imagine trying to SEO your way around that one ..!
BTW.....The number one slot is taken by a one page redirected cloaked hidden texted etc etc etc etc ....hasn't ever moved off the spot in 3 years even during all dances , updates etc, any time ,any day ,any datacenter ,www2 , www3 etc its always there ......makes you wonder if google really does have the anti cheat stuff it says it does or is it just to frighten us into being good guys and not trying this stuff....
Any body ever meet first hand someone who did get "banned" from google for this ...or are all stories just hearsay?

So anyway the rest of us are just fighting it out over the # 4 and below positions .....the original glass ceiling serps!

( I just know I'm gonna get crisped for the "french" remark ;) ...true nonetheless!...or "vrai comme m�me" pour les francophiles........

scottiecla

2:16 pm on Mar 11, 2004 (gmt 0)

There are definitely some classified ads already in the SERPs... according to the article, they seem to think it's a shame that every classified ad isn't indexed as part of the "hidden web". There is a very good reason many pages aren't indexed... they aren't useful to the average searcher.

And I have tons of dynamic pages that are less than PR3 that are indexed and rank well. Dynamic content isn't a problem in most cases.