Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- And Now Google's Doing It. JS Stats Show GoogleBot


lucy24 - 11:24 pm on May 14, 2011 (gmt 0)


72.14.x.x includes Google Wireless Transcoder, Google Translate

###. Forgot to check for those-- and Google Translate is used a lot by legitimate visitors to one page.

I still haven't seen GoogleBot disregard robots.txt

I have. Grumbled about it here [webmasterworld.com].

66.249.71.109 - - [10/May/2011:13:35:00 -0700] "GET /robots.txt HTTP/1.1" 200 480 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
...
66.249.71.218 - - [10/May/2011:23:45:29 -0700] "GET /{off-limits directory}/{filename1}.html HTTP/1.1" 200 3485 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.218 - - [11/May/2011:01:18:02 -0700] "GET /{off-limits directory}/{filename2}.html HTTP/1.1" 200 2693 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.218 - - [11/May/2011:02:28:06 -0700] "GET /{off-limits directory}/{filename3}.html HTTP/1.1" 200 4324 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.218 - - [11/May/2011:03:38:07 -0700] "GET /{off-limits directory}/{filename4}.html HTTP/1.1" 200 3841 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


I decided a while back that google is "outsourcing" its robots.txt handling. That is, instead of visiting robots.txt, assimilating its contents and keeping it in mind as you continue your crawl, the robots.txt is picked up by one robot, processed behind the scenes, and only later passed on to the general-crawling robots.

A closer look at some randomly selected raw logs shows a curious pattern. Each request for robots.txt is immediately followed by one request for a page-- generally one that google already knows no longer exists (301 or 410)-- and then the Googlebot goes away, to be followed later in the day by other Googlebots doing their stuff without reference to robots.txt.


Thread source:: http://www.webmasterworld.com/search_engine_spiders/4312058.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com