I have noticed a few strange 404s generated by Googlebot this week. These are not pages that have been deleted but files that are only available on the server console.
18.104.22.168 - - [23/Dec/2011:11:32:16 +0000] "GET /mysqltuner.pl HTTP/1.1" 404 6426 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 example.co.uk "-" "-"
That is googlebot requesting the popular mysqltuner.pl script. Why is it requesting this script? It is not on the website. It can not even be run from the website.
The file is located on the server in a utils directory I have. This is not part of document root and is never run via the website.
Other files include
show_s a script I use for examining server mail
tl abc a script which parses and tails some logfiles
And yet this week googlebot is attempting to crawl these files.
Let me repeat: These are files that are nowhere on the "website". They are files in secure locations on the server. They are just like files that are in /var/log, /usr/bin, /etc/postfix
So why the heck is Google requesting these files? How the heck does it know they exist?
It is crawling them because they are in the sitemap. For some reason the sitemap file contains entries like this:
< loc >http://www.example.co.uk/show_s< /loc >
Checking the 404 log for the past 3 months I see this:
22.214.171.124 - - [19/Dec/2011:23:06:54 +0000] "GET /show_s HTTP/1.1" 404 6016 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 example.co.uk "-" "-"
This is now self perpetuating. Googlebot sees the show_s entry in the sitemap and then a few days later requests the page again.
Did I Type it in the Browser?
One reason for this could be that I had mistakenly typed show_s in the browser.
http: // www. example.co.uk / show_s
If I did that it would of course generate a 404, get logged, end up in the sitemap and thus give a crumb for googlebot to follow.
But there is no way that I entered half a dozen server script commands into a browsers URL! And I can not find my IP address in the 404 logfile.
So what is going on here? Is google monitoring what I type on a server console via an SSH window?