Forum Moderators: open
And Now Google's Doing It. JS Stats Show GoogleBot
In a nutshell
Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /
The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
User-agent: * section and a User-agent: GoogleBot section, Google completely IGNORES the User-agent: * section. User-agent: * section into the User-agent: GoogleBot section if you want Google to see them. [edited by: TheMadScientist at 9:02 pm (utc) on May 14, 2011]
It's their web preview ... I guess they think it's cool to break protocol and standards when it comes to making their visitors happy and possibly keeping them on their site rather than just saying 'this site does not allow previews' and sending them to the site in the results.
They're actually making the request too ... The system checks for an X-Forwarded-For so if they were 'proxy requesting' for visitors that click on the preview it should show the visitor's IP Address, not theirs.
The first thing to establish is whether it is a genuine GoogleBot.
The first thing to establish is whether it is a genuine GoogleBot.
[edited by: TheMadScientist at 10:45 pm (utc) on May 14, 2011]
Far as I can tell, 64.233.x.x, 72.14.x.x, 74.125.x.x are strictly Google Web Preview
72.14.x.x includes Google Wireless Transcoder, Google Translate
I still haven't seen GoogleBot disregard robots.txt
66.249.71.109 - - [10/May/2011:13:35:00 -0700] "GET /robots.txt HTTP/1.1" 200 480 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
...
66.249.71.218 - - [10/May/2011:23:45:29 -0700] "GET /{off-limits directory}/{filename1}.html HTTP/1.1" 200 3485 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.218 - - [11/May/2011:01:18:02 -0700] "GET /{off-limits directory}/{filename2}.html HTTP/1.1" 200 2693 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.218 - - [11/May/2011:02:28:06 -0700] "GET /{off-limits directory}/{filename3}.html HTTP/1.1" 200 4324 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.218 - - [11/May/2011:03:38:07 -0700] "GET /{off-limits directory}/{filename4}.html HTTP/1.1" 200 3841 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Googlebots doing their stuff without reference to robots.txt
I'm talking real world. G, Y, M$ all crawl disallowed files.
There are worse bots to worry about - ones that offer no potential benefits.
Googlebots doing their stuff without reference to robots.txt
Or using a previously cached copy, perhaps.
IDK how they decide which page to fetch or when and I really don't want to take the time to research it and find out
As on-the-fly rendering is only done based on a user request (when a user activates previews), it’s possible that it will include embedded content which may be blocked from Googlebot using a robots.txt file.
In order for images to be embedded in previews, it is important that they are not disallowed by your robots.txt file. In order to block crawlable images from being indexed, you can use the "noindex" x-robots-tag HTTP header element.
If yours are not being indexed it's probably because there aren't enough links to them for them to be indexed without knowing the content anyway, but it's got nothing to do with the robots.txt block, because those pages will be indexed as URL only ... Check out the Google forum if you haven't ... This is one of the biggest points of confusion ... If you have seen the Google forum, then, uh, you should know locations blocked in the robots.txt are easily (and often) indexed.