homepage Welcome to WebmasterWorld Guest from 54.166.10.100
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google Requesting Server Command Line Files
Frank_Rizzo




msg:4400857
 12:28 pm on Dec 23, 2011 (gmt 0)

Strange 404
I have noticed a few strange 404s generated by Googlebot this week. These are not pages that have been deleted but files that are only available on the server console.

66.249.71.176 - - [23/Dec/2011:11:32:16 +0000] "GET /mysqltuner.pl HTTP/1.1" 404 6426 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 example.co.uk "-" "-"

That is googlebot requesting the popular mysqltuner.pl script. Why is it requesting this script? It is not on the website. It can not even be run from the website.

The file is located on the server in a utils directory I have. This is not part of document root and is never run via the website.

Other files include

show_s a script I use for examining server mail

tl abc a script which parses and tails some logfiles

And yet this week googlebot is attempting to crawl these files.

Let me repeat: These are files that are nowhere on the "website". They are files in secure locations on the server. They are just like files that are in /var/log, /usr/bin, /etc/postfix

So why the heck is Google requesting these files? How the heck does it know they exist?

Sitemap
It is crawling them because they are in the sitemap. For some reason the sitemap file contains entries like this:

< loc >http://www.example.co.uk/show_s< /loc >

Checking the 404 log for the past 3 months I see this:

66.249.72.48 - - [19/Dec/2011:23:06:54 +0000] "GET /show_s HTTP/1.1" 404 6016 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 example.co.uk "-" "-"

This is now self perpetuating. Googlebot sees the show_s entry in the sitemap and then a few days later requests the page again.

Did I Type it in the Browser?
One reason for this could be that I had mistakenly typed show_s in the browser.

http: // www. example.co.uk / show_s

If I did that it would of course generate a 404, get logged, end up in the sitemap and thus give a crumb for googlebot to follow.

But there is no way that I entered half a dozen server script commands into a browsers URL! And I can not find my IP address in the 404 logfile.

So what is going on here? Is google monitoring what I type on a server console via an SSH window?

 

tedster




msg:4400977
 7:08 pm on Dec 23, 2011 (gmt 0)

It is crawling them because they are in the sitemap. For some reason the sitemap file contains entries like this:

< loc >http://www.example.co.uk/show_s< /loc >

I'd say getting those files out of the sitemap is your first order of business. Something isn't right in your sitemap generation.

Frank_Rizzo




msg:4400986
 7:36 pm on Dec 23, 2011 (gmt 0)

I am using the google sitemap_gen.py script. This generates the sitemap files from whatever is in the access_log file only.

I can see that I should have stripped out / filtered out the 404s before the sitemap_gen is run but that's a moot point.

If there is an entry in the log file for example.com/mysqltuner.pl or example.com/my_custom_SHELL_script_name what put it there in the first place?

g1smd




msg:4401002
 7:59 pm on Dec 23, 2011 (gmt 0)

My first thought, having read only the first few lines of the original post, was that perhaps Google are compiling a list of compromised servers to then either alert webmasters to potential problems or to be alert to the server filling up with spam at a later date.

Having read on, it appears the sitemap generation script and maybe some other processes are flawed/compromised and wholly unsafe.

Frank_Rizzo




msg:4401011
 8:40 pm on Dec 23, 2011 (gmt 0)

The simple answer could just be that I may have just typed them in on the URL line, which then got into the sitemap. But this is very unlikely.

Not all files on the server are crawled. It is just a few I use the most on the server console.

tedster




msg:4401023
 9:07 pm on Dec 23, 2011 (gmt 0)

According to the help files for the sitemap_gen.py Python sitemap script, the only URLs added to the sitemap from the access logs should be those that got a 200 response:

When reading access log entries, the sitemap generator will include in the sitemap only the URLs that return HTTP response status 200 (OK). It is thus necessary, in order to avoid inclusion of non-existent URLs, to have a website set-up that will return 404 (not found) HTTP response status for non-existent URLs, not a redirection to a page returning HTTP status 200 (OK).

[smart-it-consulting.com...]

So it still sounds like a config problem to me - either that your your server is not sending a true 404 status in the http header but only a "404 page" with a 200 status.

Frank_Rizzo




msg:4401037
 10:09 pm on Dec 23, 2011 (gmt 0)

Found it!

I traced it to a new server I set up last week. It looks as if I used the home directory of the main site as a transit.

The new server performed wget example.com/various_script_files in batches. As the files were in the transit directory they *were* visible in public_html and thus crawlable and returning a 200 status.

My mistake.

Thanks for the interest.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved