Google's Current Specifications for Robots Directives

Forum Moderators: goodroi

Message Too Old, No Replies

Google's Current Specifications for Robots Directives

tedster

11:44 pm on Nov 24, 2010 (gmt 0)

This is a pretty cool reference - the exact information about Google's current handling of robots directives:

robots.txt specifications [code.google.com]

robots meta tags and x-robots directives [code.google.com]

I've already learned something new. Google will look for and obey an FTP robots.txt file located at ftp://example.com/robots.txt

And here's the cover page for the entire robots collection [code.google.com].

Samizdata

1:07 am on Nov 25, 2010 (gmt 0)

Note that none of this applies to the new Google Web Preview bot.

It only exists to bypass the robots.txt restrictions placed on Googlebot.

...

lammert

3:41 am on Nov 25, 2010 (gmt 0)

A nice reference about Google's interpretation of robot directives. The only problem is that most of us use robot directives to keep unwanted bots out, rather than let Googlebot in. So Google's interpretation is only of limited value in the real world.

tedster

4:28 am on Nov 25, 2010 (gmt 0)

Gee, that's not me. Most bots that I worry about don't even take a look at robots.txt or x-robots directives - and the robots meta tag is irrelevant for them. Those bots need very special treatment.

With Google indexing so important in North America, I'm very happy to know how they interpret these aging technical specs.

tangor

5:52 am on Nov 25, 2010 (gmt 0)

"robots.txt"? We don't need no steenkin' robots.txt

Except for those that do honor whitelisting, etc. Google, as usual, seeks to bend standards to their standard. And, sad to say, the sheeple webmasters will follow.

g1smd

8:33 am on Nov 25, 2010 (gmt 0)

Useful document, clarifies a lot of things.

Needs very careful reading.

Pfui

9:52 am on Nov 25, 2010 (gmt 0)

Ironically, a related document with a 2010 (c) date listing their "crawlers" is conflicting...

FYI
Appendix: Google's website crawlers [code.google.com]

NEW (?)
Googlebot-News (n�e Feedfetcher?)
Googlebot-Video
(also)
"Google also uses some other user-agents, not listed here, to fetch content in real time in response to a user's action." (That sounds like the troublesome new Google Web Preview [webmasterworld.com]...)

AWOL
(in no particular order)
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; [...etc.]
Google Web Preview
Google Wireless Transcoder
Google Keyword Generator
Google-Site-Verification
AppEngine-Google spawn
Any UA... "(via translate.google.com)"
NO UA

FWIW
The Appendix is referenced in G's robots.txt docs rehash and refers to G's listed agents as "crawlers." That could be a doublespeakish way of saying the list-AWOL UAs ignore robots.txt. Because they do.

pageoneresults

12:20 pm on Nov 25, 2010 (gmt 0)

Redirects will generally be followed until a valid result can be found (or a loop is recognized). We will follow a limited number of redirect hops (RFC 1945 for HTTP/1.0 allows up to 5 hops) and then stop and treat it as a 404.

Nice to see that in writing from Google. I've been referencing RFC 1945 for years and some folks have argued with me on how many hops Googlebot will follow. Anything more than 2 and you're in the Kiss of Death zone. ;)

It's nice to have an official resource to point to. That robotstxt.org website was fed to the dogs a couple of years ago.

graeme_p

8:04 am on Nov 26, 2010 (gmt 0)

@tangor, how are Google trying to bend the standard? As far as I can see the differences are the addition of the wildcard character (which all the major search engines support), accepting UTF-8 (with option BOM), the addition of the sitemap directive, and the limit on the number of redirects. All should be backward compatible under any reasonable circumstances. Am I missing anything that matters?

phranque

9:00 am on Nov 26, 2010 (gmt 0)

I've been referencing RFC 1945 for years and some folks have argued with me on how many hops Googlebot will follow.

it should be noted that the statement you quoted is specifically referring to googlebot requests of the robots.txt file and in typical google fashion is neither specific about how many more or less than 5 hops googlebot will endure while requesting robots.txt nor about how many hops are acceptable for any other urls requested by googlebot.