Page is a not externally linkable
Sgt_Kickaxe - 11:43 pm on Sep 4, 2011 (gmt 0)
Confirmed, googlebot has requested a page in my 3 day old honeypot for a page that has never existed and, obviously, has never been linked to. The ONLY place it is even mentioned online is in the .htaccess file to control the honeypot as well as in the robots.txt file itself.
Parsing logs shows me that googlebot also likes to check if my site is wordpress based by attempting to load up example.com/xmlrpc.php, a standard wordpress file. My site isn't wordpress based so the request returns an error instead of the wordpress default "XML-RPC server accepts POST requests only." message. I would guess that google does this check to see if the site is wordpress, or perhaps they want to see if the site is secure? Whichever the case, googlebot can and does ignore robots.txt on occasion.
Perhaps we need a list from Google explaining EXACTLY under which circumstances they would/will ignore robots.txt... since they are essentially acting like a spam bot or scraper at that point. it's no longer a matter of IF they ignore it but WHY and under what conditions.
P.S. for wordpress based sites - if you have remote publishing off you can set up htaccess to immediately ban any ip address attempting to load that file.