Forum Moderators: open
Google have always been very well on spidering serp's on google.com itself, maybe that's the "the" reason.
Fun thing to do is to enter portals and search engines in google and do a SITE: on them, and see how many of the results they've indexed, then check out the robots.txt on those sites ;)
How to increase the number of indexed pages in your own search engine?
Erm...Delete the robots.txt file of course!
So next months total number of pages indexed by Google is
6 billion
that should make us the biggest serch engine again.
Never mind Alltheweb, you did beat them for a short while :)
[google.com...]
google.com (no www) doesn't need a robots.txt because every page there redirects to the same page on www.google.com
Google doesn't use invisible redirects so while your browser may follow a 301 or 302 redirect, a robot will not.
As it currently stands, any bot looking for a robots.txt on google.com (without the www) will not find one and will have every right under the standards set forth under the robots.txt exclusion protocol to fully spider any portion of their website.
What you fail to understand is a robot can follow any www.google.com link it finds (or on any subdomain at Google for that matter) if it can't find a robots.txt at google.com. Robots.txt files are not supposed to redirect. Even GoogleGuy would agree with that.
But I think they're working on fixing the problem now.
What you fail to understand is a robot can follow any www.google.com link it finds (or on any subdomain at Google for that matter) if it can't find a robots.txt at google.com.
That makes no damn sense. You're falling for the "example.com == www.example.com" fallacy. It's not true. "google.com" and "www.google.com" are allowed to be completely different servers (content-wise, physically, and administratively). Quoting from that most Holy of SEO Holies, the robots.txt spec [robotstxt.org]:
The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt".
"local URL", in this (slightly archaic) instance means "relative URI". google.com/robots.txt is not a relative URI for anything on www.google.com, because they've got different absolute roots.
The only robots.txt that applies to a given URI is the robots.txt accessed using the same fully-qualified domain name. That's the spec, and it's the spec for a damn good reason: managing webservers in the .uk TLD (among others) would be chaos if things were done your way.