| 7:08 pm on Dec 23, 2011 (gmt 0)|
|It is crawling them because they are in the sitemap. For some reason the sitemap file contains entries like this: |
< loc >http://www.example.co.uk/show_s< /loc >
I'd say getting those files out of the sitemap is your first order of business. Something isn't right in your sitemap generation.
| 7:36 pm on Dec 23, 2011 (gmt 0)|
I am using the google sitemap_gen.py script. This generates the sitemap files from whatever is in the access_log file only.
I can see that I should have stripped out / filtered out the 404s before the sitemap_gen is run but that's a moot point.
If there is an entry in the log file for example.com/mysqltuner.pl or example.com/my_custom_SHELL_script_name what put it there in the first place?
| 7:59 pm on Dec 23, 2011 (gmt 0)|
My first thought, having read only the first few lines of the original post, was that perhaps Google are compiling a list of compromised servers to then either alert webmasters to potential problems or to be alert to the server filling up with spam at a later date.
Having read on, it appears the sitemap generation script and maybe some other processes are flawed/compromised and wholly unsafe.
| 8:40 pm on Dec 23, 2011 (gmt 0)|
The simple answer could just be that I may have just typed them in on the URL line, which then got into the sitemap. But this is very unlikely.
Not all files on the server are crawled. It is just a few I use the most on the server console.
| 9:07 pm on Dec 23, 2011 (gmt 0)|
According to the help files for the sitemap_gen.py Python sitemap script, the only URLs added to the sitemap from the access logs should be those that got a 200 response:
|When reading access log entries, the sitemap generator will include in the sitemap only the URLs that return HTTP response status 200 (OK). It is thus necessary, in order to avoid inclusion of non-existent URLs, to have a website set-up that will return 404 (not found) HTTP response status for non-existent URLs, not a redirection to a page returning HTTP status 200 (OK). |
So it still sounds like a config problem to me - either that your your server is not sending a true 404 status in the http header but only a "404 page" with a 200 status.
| 10:09 pm on Dec 23, 2011 (gmt 0)|
I traced it to a new server I set up last week. It looks as if I used the home directory of the main site as a transit.
The new server performed wget example.com/various_script_files in batches. As the files were in the transit directory they *were* visible in public_html and thus crawlable and returning a 200 status.
Thanks for the interest.