homepage Welcome to WebmasterWorld Guest from 23.23.9.5
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Google's Current Specifications for Robots Directives
tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4234719 posted 11:44 pm on Nov 24, 2010 (gmt 0)

This is a pretty cool reference - the exact information about Google's current handling of robots directives:

robots.txt specifications [code.google.com]

robots meta tags and x-robots directives [code.google.com]

I've already learned something new. Google will look for and obey an FTP robots.txt file located at ftp://example.com/robots.txt

And here's the cover page for the entire robots collection [code.google.com].

 

Samizdata

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4234719 posted 1:07 am on Nov 25, 2010 (gmt 0)

Note that none of this applies to the new Google Web Preview bot.

It only exists to bypass the robots.txt restrictions placed on Googlebot.

...

lammert

WebmasterWorld Senior Member lammert us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4234719 posted 3:41 am on Nov 25, 2010 (gmt 0)

A nice reference about Google's interpretation of robot directives. The only problem is that most of us use robot directives to keep unwanted bots out, rather than let Googlebot in. So Google's interpretation is only of limited value in the real world.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4234719 posted 4:28 am on Nov 25, 2010 (gmt 0)

Gee, that's not me. Most bots that I worry about don't even take a look at robots.txt or x-robots directives - and the robots meta tag is irrelevant for them. Those bots need very special treatment.

With Google indexing so important in North America, I'm very happy to know how they interpret these aging technical specs.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4234719 posted 5:52 am on Nov 25, 2010 (gmt 0)

"robots.txt"? We don't need no steenkin' robots.txt

Except for those that do honor whitelisting, etc. Google, as usual, seeks to bend standards to their standard. And, sad to say, the sheeple webmasters will follow.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4234719 posted 8:33 am on Nov 25, 2010 (gmt 0)

Useful document, clarifies a lot of things.

Needs very careful reading.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4234719 posted 9:52 am on Nov 25, 2010 (gmt 0)

Ironically, a related document with a 2010 (c) date listing their "crawlers" is conflicting...

FYI
Appendix: Google's website crawlers [code.google.com]

NEW (?)
Googlebot-News (née Feedfetcher?)
Googlebot-Video
(also)
"Google also uses some other user-agents, not listed here, to fetch content in real time in response to a user's action." (That sounds like the troublesome new Google Web Preview [webmasterworld.com]...)

AWOL
(in no particular order)
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; [...etc.]
Google Web Preview
Google Wireless Transcoder
Google Keyword Generator
Google-Site-Verification
AppEngine-Google spawn
Any UA... "(via translate.google.com)"
NO UA

FWIW
The Appendix is referenced in G's robots.txt docs rehash and refers to G's listed agents as "crawlers." That could be a doublespeakish way of saying the list-AWOL UAs ignore robots.txt. Because they do.

pageoneresults

WebmasterWorld Senior Member pageoneresults us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4234719 posted 12:20 pm on Nov 25, 2010 (gmt 0)

Redirects will generally be followed until a valid result can be found (or a loop is recognized). We will follow a limited number of redirect hops (RFC 1945 for HTTP/1.0 allows up to 5 hops) and then stop and treat it as a 404.


Nice to see that in writing from Google. I've been referencing RFC 1945 for years and some folks have argued with me on how many hops Googlebot will follow. Anything more than 2 and you're in the Kiss of Death zone. ;)

It's nice to have an official resource to point to. That robotstxt.org website was fed to the dogs a couple of years ago.

graeme_p

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4234719 posted 8:04 am on Nov 26, 2010 (gmt 0)

@tangor, how are Google trying to bend the standard? As far as I can see the differences are the addition of the wildcard character (which all the major search engines support), accepting UTF-8 (with option BOM), the addition of the sitemap directive, and the limit on the number of redirects. All should be backward compatible under any reasonable circumstances. Am I missing anything that matters?

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4234719 posted 9:00 am on Nov 26, 2010 (gmt 0)

I've been referencing RFC 1945 for years and some folks have argued with me on how many hops Googlebot will follow.

it should be noted that the statement you quoted is specifically referring to googlebot requests of the robots.txt file and in typical google fashion is neither specific about how many more or less than 5 hops googlebot will endure while requesting robots.txt nor about how many hops are acceptable for any other urls requested by googlebot.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved