Welcome to WebmasterWorld Guest from 54.158.253.14

Forum Moderators: goodroi

Message Too Old, No Replies

Google's Current Specifications for Robots Directives

     
11:44 pm on Nov 24, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


This is a pretty cool reference - the exact information about Google's current handling of robots directives:

robots.txt specifications [code.google.com]

robots meta tags and x-robots directives [code.google.com]

I've already learned something new. Google will look for and obey an FTP robots.txt file located at ftp://example.com/robots.txt

And here's the cover page for the entire robots collection [code.google.com].
1:07 am on Nov 25, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
posts:1312
votes: 0


Note that none of this applies to the new Google Web Preview bot.

It only exists to bypass the robots.txt restrictions placed on Googlebot.

...
3:41 am on Nov 25, 2010 (gmt 0)

Senior Member from KZ 

WebmasterWorld Senior Member lammert is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 10, 2005
posts: 2932
votes: 20


A nice reference about Google's interpretation of robot directives. The only problem is that most of us use robot directives to keep unwanted bots out, rather than let Googlebot in. So Google's interpretation is only of limited value in the real world.
4:28 am on Nov 25, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


Gee, that's not me. Most bots that I worry about don't even take a look at robots.txt or x-robots directives - and the robots meta tag is irrelevant for them. Those bots need very special treatment.

With Google indexing so important in North America, I'm very happy to know how they interpret these aging technical specs.
5:52 am on Nov 25, 2010 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:7574
votes: 512


"robots.txt"? We don't need no steenkin' robots.txt

Except for those that do honor whitelisting, etc. Google, as usual, seeks to bend standards to their standard. And, sad to say, the sheeple webmasters will follow.
8:33 am on Nov 25, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Useful document, clarifies a lot of things.

Needs very careful reading.
9:52 am on Nov 25, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Ironically, a related document with a 2010 (c) date listing their "crawlers" is conflicting...

FYI
Appendix: Google's website crawlers [code.google.com]

NEW (?)
Googlebot-News (née Feedfetcher?)
Googlebot-Video
(also)
"Google also uses some other user-agents, not listed here, to fetch content in real time in response to a user's action." (That sounds like the troublesome new Google Web Preview [webmasterworld.com]...)

AWOL
(in no particular order)
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; [...etc.]
Google Web Preview
Google Wireless Transcoder
Google Keyword Generator
Google-Site-Verification
AppEngine-Google spawn
Any UA... "(via translate.google.com)"
NO UA

FWIW
The Appendix is referenced in G's robots.txt docs rehash and refers to G's listed agents as "crawlers." That could be a doublespeakish way of saying the list-AWOL UAs ignore robots.txt. Because they do.
12:20 pm on Nov 25, 2010 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 27, 2001
posts: 12172
votes: 60


Redirects will generally be followed until a valid result can be found (or a loop is recognized). We will follow a limited number of redirect hops (RFC 1945 for HTTP/1.0 allows up to 5 hops) and then stop and treat it as a 404.


Nice to see that in writing from Google. I've been referencing RFC 1945 for years and some folks have argued with me on how many hops Googlebot will follow. Anything more than 2 and you're in the Kiss of Death zone. ;)

It's nice to have an official resource to point to. That robotstxt.org website was fed to the dogs a couple of years ago.
8:04 am on Nov 26, 2010 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Nov 16, 2005
posts:2645
votes: 83


@tangor, how are Google trying to bend the standard? As far as I can see the differences are the addition of the wildcard character (which all the major search engines support), accepting UTF-8 (with option BOM), the addition of the sitemap directive, and the limit on the number of redirects. All should be backward compatible under any reasonable circumstances. Am I missing anything that matters?
9:00 am on Nov 26, 2010 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10858
votes: 67


I've been referencing RFC 1945 for years and some folks have argued with me on how many hops Googlebot will follow.

it should be noted that the statement you quoted is specifically referring to googlebot requests of the robots.txt file and in typical google fashion is neither specific about how many more or less than 5 hops googlebot will endure while requesting robots.txt nor about how many hops are acceptable for any other urls requested by googlebot.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members