| 1:07 am on Nov 25, 2010 (gmt 0)|
Note that none of this applies to the new Google Web Preview bot.
It only exists to bypass the robots.txt restrictions placed on Googlebot.
| 3:41 am on Nov 25, 2010 (gmt 0)|
A nice reference about Google's interpretation of robot directives. The only problem is that most of us use robot directives to keep unwanted bots out, rather than let Googlebot in. So Google's interpretation is only of limited value in the real world.
| 4:28 am on Nov 25, 2010 (gmt 0)|
Gee, that's not me. Most bots that I worry about don't even take a look at robots.txt or x-robots directives - and the robots meta tag is irrelevant for them. Those bots need very special treatment.
With Google indexing so important in North America, I'm very happy to know how they interpret these aging technical specs.
| 5:52 am on Nov 25, 2010 (gmt 0)|
"robots.txt"? We don't need no steenkin' robots.txt
Except for those that do honor whitelisting, etc. Google, as usual, seeks to bend standards to their standard. And, sad to say, the sheeple webmasters will follow.
| 8:33 am on Nov 25, 2010 (gmt 0)|
Useful document, clarifies a lot of things.
Needs very careful reading.
| 9:52 am on Nov 25, 2010 (gmt 0)|
Ironically, a related document with a 2010 (c) date listing their "crawlers" is conflicting...
Appendix: Google's website crawlers [code.google.com]
Googlebot-News (née Feedfetcher?)
"Google also uses some other user-agents, not listed here, to fetch content in real time in response to a user's action." (That sounds like the troublesome new Google Web Preview [webmasterworld.com]...)
(in no particular order)
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; [...etc.]
Google Web Preview
Google Wireless Transcoder
Google Keyword Generator
Any UA... "(via translate.google.com)"
The Appendix is referenced in G's robots.txt docs rehash and refers to G's listed agents as "crawlers." That could be a doublespeakish way of saying the list-AWOL UAs ignore robots.txt. Because they do.
| 12:20 pm on Nov 25, 2010 (gmt 0)|
|Redirects will generally be followed until a valid result can be found (or a loop is recognized). We will follow a limited number of redirect hops (RFC 1945 for HTTP/1.0 allows up to 5 hops) and then stop and treat it as a 404. |
Nice to see that in writing from Google. I've been referencing RFC 1945 for years and some folks have argued with me on how many hops Googlebot will follow. Anything more than 2 and you're in the Kiss of Death zone. ;)
It's nice to have an official resource to point to. That robotstxt.org website was fed to the dogs a couple of years ago.
| 8:04 am on Nov 26, 2010 (gmt 0)|
@tangor, how are Google trying to bend the standard? As far as I can see the differences are the addition of the wildcard character (which all the major search engines support), accepting UTF-8 (with option BOM), the addition of the sitemap directive, and the limit on the number of redirects. All should be backward compatible under any reasonable circumstances. Am I missing anything that matters?
| 9:00 am on Nov 26, 2010 (gmt 0)|
|I've been referencing RFC 1945 for years and some folks have argued with me on how many hops Googlebot will follow. |
it should be noted that the statement you quoted is specifically referring to googlebot requests of the robots.txt file and in typical google fashion is neither specific about how many more or less than 5 hops googlebot will endure while requesting robots.txt nor about how many hops are acceptable for any other urls requested by googlebot.