Forum Moderators: open

Message Too Old, No Replies

Google IP grabbing JS and CSS only - bot?

         

Receptional Andy

1:57 pm on Feb 24, 2009 (gmt 0)



These solitary requests in the log file for a development (although publically available site) raised an eyebrow:


74.125.75.xx - - [24/Feb/2009:01:23:45 -0500] "GET /robots_excluded/javascript.js HTTP/1.1" 200 701 "http://www.example.com/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7"

74.125.75.xx - - [24/Feb/2009:01:23:45 -0500] "GET /robots_excluded/style.css HTTP/1.1" 200 31117 "http://www.example.com/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7"

Googlebot requested the index page of this site (which references the files above) a couple of days ago. Note that CSS and JS are robots excluded. There were no other requests from this IP.

No rDNS, but seems like pretty botty behaviour to me. I note the reference to this IP range in the appengine [webmasterworld.com] post also, but I don't think this is the same thing?

incrediBILL

7:51 pm on Feb 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sounds more like a cache server that cached the page but not the other components, maybe the Web Accelerator but I thought it was axed.

Samizdata

9:39 pm on Feb 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have always assumed that these requests were part of Google's automated checks to see if anything untoward or deceptive is going on - I have seen them for a long time and allow them.

Both CSS and javascript files are disallowed in my robots.txt - which Googlebot honours.

But these requests are never from Googlebot.

...

maybe the Web Accelerator but I thought it was axed

Google have indeed withdrawn it recently, buy it is still available elsewhere.

...

enigma1

2:09 pm on Feb 25, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



SEs will access pages and folders even if excluded in the robots.txt, google is not an exception. Here are a couple of ways someone can do that, both external.

1. Link posted by another entity elsewhere http://www.example.com/robots_excluded/javascript.js

2. From another server by sending redirect headers upon a request (ex: in PHP) forcing the bot to access a link.
header("HTTP/1.1 301");
header('Location: http://www.example.com/robots_excluded/javascript.js');

Other ways are also possible, of course internal requests can also be forced in a similar manner.

I run some tests and seems that the above may happen, I was also asking here
[webmasterworld.com...]
for the landing pages regarding slurp in the past when I noticed some strange behavior.

What I haven't confirmed yet is whether or not forcing page access is also indexed in the SE results. I haven't seen them indexed by just doing that so far.

Of course in your case may not be a foul play and the request may came as a verification step from google for the secondary files listed with the page content.

keyplyr

8:55 am on Feb 26, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A robots.txt disallow statement is a request that the specific file not be indexed. Most bots then do not request that file, but it is not against the robots.txt standard to do so.

Samizdata

2:30 pm on Feb 26, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think the point here is that "Googlebot" does not violate robots.txt instructions (at least in my experience). But that is not the whole story, just good public relations from Mountain View.

In order to check for attempts to manipulate the SERPs the search engines have to look at things such as CSS and javascript (even where disallowed), and the idea that this is always done with a manual review seems far-fetched to me, given the size of the web.

I believe these requests are automated checks - a robot from Google that does not declare itself.

I allow them because I accept that they are necessary and I have nothing to hide. From what I have read in this forum others block them with impunity, which suggests that a manual check will be forced in such circumstances.

Either way, Googlebot itself gets a clean bill of health.

All a bit of a sham, but pragmatism suits me.

...