Welcome to WebmasterWorld Guest from 126.96.36.199 , register , free tools , login , search , pro membership , help , library , announcements , recent posts , open posts Become a Pro Member
Google Requesting Server Command Line Files Frank_Rizzo
Strange 404 I have noticed a few strange 404s generated by Googlebot this week. These are not pages that have been deleted but files that are only available on the server console. 188.8.131.52 - - [23/Dec/2011:11:32:16 +0000] "GET /mysqltuner.pl HTTP/1.1" 404 6426 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 example.co.uk "-" "-" That is googlebot requesting the popular mysqltuner.pl script. Why is it requesting this script? It is not on the website. It can not even be run from the website. The file is located on the server in a utils directory I have. This is not part of document root and is never run via the website. Other files include show_s a script I use for examining server mail tl abc a script which parses and tails some logfiles And yet this week googlebot is attempting to crawl these files. Let me repeat: These are files that are nowhere on the "website". They are files in secure locations on the server. They are just like files that are in /var/log, /usr/bin, /etc/postfix So why the heck is Google requesting these files? How the heck does it know they exist? Sitemap It is crawling them because they are in the sitemap. For some reason the sitemap file contains entries like this: < loc >http://www.example.co.uk/show_s< /loc > Checking the 404 log for the past 3 months I see this: 184.108.40.206 - - [19/Dec/2011:23:06:54 +0000] "GET /show_s HTTP/1.1" 404 6016 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 example.co.uk "-" "-" This is now self perpetuating. Googlebot sees the show_s entry in the sitemap and then a few days later requests the page again. Did I Type it in the Browser? One reason for this could be that I had mistakenly typed show_s in the browser. http: // www. example.co.uk / show_s If I did that it would of course generate a 404, get logged, end up in the sitemap and thus give a crumb for googlebot to follow. But there is no way that I entered half a dozen server script commands into a browsers URL! And I can not find my IP address in the 404 logfile. So what is going on here? Is google monitoring what I type on a server console via an SSH window?
It is crawling them because they are in the sitemap. For some reason the sitemap file contains entries like this: < loc >http://www.example.co.uk/show_s< /loc > I'd say getting those files out of the sitemap is your first order of business. Something isn't right in your sitemap generation. Frank_Rizzo
I am using the google sitemap_gen.py script. This generates the sitemap files from whatever is in the access_log file only. I can see that I should have stripped out / filtered out the 404s before the sitemap_gen is run but that's a moot point. If there is an entry in the log file for example.com/mysqltuner.pl or example.com/my_custom_SHELL_script_name what put it there in the first place? g1smd
My first thought, having read only the first few lines of the original post, was that perhaps Google are compiling a list of compromised servers to then either alert webmasters to potential problems or to be alert to the server filling up with spam at a later date. Having read on, it appears the sitemap generation script and maybe some other processes are flawed/compromised and wholly unsafe. Frank_Rizzo
The simple answer could just be that I may have just typed them in on the URL line, which then got into the sitemap. But this is very unlikely. Not all files on the server are crawled. It is just a few I use the most on the server console. tedster
According to the help files for the sitemap_gen.py Python sitemap script, the only URLs added to the sitemap from the access logs should be those that got a 200 response:
When reading access log entries, the sitemap generator will include in the sitemap only the URLs that return HTTP response status 200 (OK). It is thus necessary, in order to avoid inclusion of non-existent URLs, to have a website set-up that will return 404 (not found) HTTP response status for non-existent URLs, not a redirection to a page returning HTTP status 200 (OK). [ ...] smart-it-consulting.com So it still sounds like a config problem to me - either that your your server is not sending a true 404 status in the http header but only a "404 page" with a 200 status. Frank_Rizzo
Found it! I traced it to a new server I set up last week. It looks as if I used the home directory of the main site as a transit. The new server performed wget example.com/various_script_files in batches. As the files were in the transit directory they *were* visible in public_html and thus crawlable and returning a 200 status. My mistake. Thanks for the interest.