Forum Moderators: DixonJones
Hello Forum 39, I'm DerekH - I tend to live in Forum3, where you can see that I've posted quite a lot.
I've just recently put up a new site with a new ISP.
It sits on Unix/Apache and offers me Webalizer stats.
I put up a site just before Christmas, using Dreamweaver, and I'm totally sure the site has no broken links.
I put up a custom 404 page that gives visitors a way back into the site index.
I can see from the Webalizer stats that the 404 page is being served up about 20 times a day.
I can see from the webalizer.current file that the references seem to be coming from outside my site.
And there I'm drawing a blank...
The referring IP addresses seem to be IP addresses, not URLs, and I can't seem to see WHAT they are trying to access.
Can someone help me find out a little bit more about *why* my 404 page is being served?
Thanks!
DerekH
try checking your site using xenu
Just checked the site in Dreamweaver again, and it's 100% watertight...
Since it's a new site, I can't imagine where erroneous URLs are coming from...
Thanks for the help - if I can't do it without raw logs, then you've answered my questions!
DerekH
I can't imagine where erroneous URLs are coming from
Log the referrer in the 404 page
Now that sounds the proper solution, but I don't have the foggiest how to do it. I'm quite happy to roll my sleeves up and do what needs doing, but how might I do this? I don't have any databases on the server that I can use to store things. Or have I overlooked a simple way to do this? As I said, I have access to htaccess, and also PHP and PERL, but not to the raw logs.
Sorry to look so helpless <grin>
DerekH
Log the referrer in the 404 page
Just did a quick test though, and the results weren't very promising. $HTTP_REFERER just gives you the page that requested the offending URL ... not the missing item itself (which is shown on the line above the 404 page access in the raw log file).
The http_referrer will let you know the referring page, the query_string should contain something like "404;http://yoursite.com/missingpage.htm" (under IIS, although there should be something similar under Apache).
Voila, no need to muck around trying to get the logs (although it's always useful to have them!)
Well, I'm taking it in stages and I'll see what gives...
I get fresh Webalizer stats every 24 hours, so I'm making one change a day and seeing what happens.
First ploy was to move my 404 file somewhere else so I can start counting the new page from zero.
Next was to add a favicon.ico file to see whether that was part of it.
Next I'll add an empty robots.txt
If that doesn't cure it, I'll get into the PHP.
Thanks everyone - I can get there from here now!
DerekH
I have found that you need to have BOTH a robots.txt and a favicon.ico files on your site as they will be requested by robots and browsers no matter what. I would suggest that instead of using the try and find method, you just add both those files right away as they are going to both cause 404s if they do not exist.
In more detail:
Any well-behaved bot will check (periodically) for a robots.txt file, by making a http request. If you don't have one, it will generate a 404, just like any other missing file (you will also see 404s for missing collateral files such as jpg, gif, etc.). The simple solution is to put up a blank or simple robots.txt file:
# Standard - Allow Everyone Everywhere Robots Policy File
# server does not parse this file, so no includes allowedUser-agent: *# all robots
#Disallow:# allow everywhere
As to the favicon.ico, MSIE will request that file in the current or root directory of your site if someone bookmarks the page, also, many other browsers (Safari, Mozilla, etc) will use that image in the address/tab bar for the site. If you don't have logo for your site, you can find lots of royalty free icons on the web by searching the usual places.
Ideally, if you had access to your raw logs, you could tell the analysis tool to ignore requests for those files, so that your analysis had the details you need and not the items that you don't care about.
Hope this helps,
Larry