Forum Moderators: phranque

Message Too Old, No Replies

Error log 404 hits in wrong directory

404 error directory root

         

Martin_Sach

8:45 am on Sep 15, 2007 (gmt 0)

10+ Year Member



I use a CGI script 404helper.cgi that I have used for years and that sends a regular e-mail listing 404 error messages. It includes the source page for broken links within the site, or sometimes does. Mostly it lists just the target file that the user tried to obtain. There is a custom 404 error page in use on the site.

The errors seem to fall into three categories:
1. Hits on pages that used to exist and no longer do - no mystery there.
2. Hits on pages that have never ever existed. Curious but cannot be something under my control.
3. Hits on pages that do exist in a sub-directory but the 404 error relates to the same file name, shown in the root directory. This is the phenomenen that I am wondering how to cure. The site uses the UDM Menu system from Brothercake, and I don't know if this has anything to do with it. I am confident that incorrect links are, if any, few and far between. This third type seems to be the largest number.

Does anyone know what is happening here?

jdMorgan

2:15 pm on Sep 15, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Or it could just be a badly-written site scraper -- Or even a legitimate search engine robot; Yahoo Slurp, for example, has the annoying habit of trying to fetch 'directories' that are not linked on the site, apparently based upon finding a 'page' in that directory. So, they find /foo/bar.html, and try to fetch /foo/
This is a problem if the server is configured with "Options -Indexes" and there is no index page at /foo/index.xyz, since the server returns a 403-Forbidden response as documented.

I'd suggest you take these e-mail notifications, look up the actual transaction in your raw server access log file based on time and IP address (or remote hostname) and dig into the exact 'circumstances' of the incorrect-directory URL requests. Look at the HTTP referrer (if the client opted to sent one) and also look at the requestor's IP and hostname; If the remote host is located at a server farm, it's a good bet that someone was just using a buggy script to scrape your site.

Jim