404 error - HTML forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

404 error

A lot of 404 error that come from no way.

Lowkei

12:59 pm on Jul 14, 2005 (gmt 0)

10+ Year Member

Hi all,
I'm new to the webmaster world.And I'm facing some problems on 404 error..Can you guys give some guide to me?Thanks :)
My site got a lot of 404 error from which is shown in the server stat.However,there's no referer address.I can't track where is my mistake.

My first question,is there any way to track those 404 beside clicking every link?

I have actually clicked through every link in my site and every single link is working properly.
Then somehow I have found out that errors happen in my 404.shtml.I use intra-link to point to pages in my site.Exp. a page with URL www.example/abc.htm, it will lead to 404 error ( URL- www.example/abc/404.htm )when i type URL www.example/abc.html.The extra 'l'.Then in that 404.shtml,i simply click a intra-link(linking address - /123.htm ) which is suppose to lead to the URL www.example/123.htm.But what happen is the URL has become www.example/abc/123.htm which leading to no way.

My second question is,is it a must we use the full address (with the "www.") in the 404.shtml?

However,after changing all the link in the 404.shtml into a "http://www." form and i test clicking through every link in 404.shtml and it should be no problem.But still the amount for my 404 in the server stat is growing.
Can you guys help to point out my mistake?Thanks a million in advance. :)

In need of guide.
Lowkei

James_Lucas

2:59 pm on Jul 14, 2005 (gmt 0)

10+ Year Member

You can download link checking applications from various places on the Internet.

And about your linking problem, you need to make sure that when you link something, you add "http://" to it, it doesn't matter if you add "www" or not (That just has to do with the Domain's "A records", depending on them, depends if it works with or without the "www")

So for example you'd want to have
<a href="http://server.com/">Link</a>

or

<a href="http://www.server.com/">Link</a>

Without the "http" you're bound to get the 404.

- James.

JAB Creations

3:26 pm on Jul 14, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I viciously hunt down 404s and setup 301s in place or tell retard bots to go away.

With version 6.4 the major security bug in AW stats is now fixed. Download awstats (http://awstats.sourceforge.net/) and at the bottom of the script (though it can be a pain in the ass to install) will be various http codes including 404s. It will show you a list of all 404s, number of requests per 404 and typically the referrer.

While that may help you some it won't explain all your questions and you'll still probally get a few 404s that you can't figure out.

The only way to figure this out is via your access logs. An access log is a line by line log file. Each line is a single hit. Each file requested is a hit. So a page with 20 images = 21 hits. One file plus 20 images = 21 hits and therefor your logfile would add 21 lines. You most likely have apache (hopeuflly) and if your host doesn't suck you should have no prob getting (or asking for) your access logs. If you have any problem getting them simply change hosts because no logs = total retardation.

To understand many of your 404s you should look not just at a line with a 404. Here is an example of a few hits from an older access log of mine (with the ips X'ed out).

A quick note, make sure you associate your logfiles with Notepad (on windows). Open notepad (before your log) and make sure it is not using word wrap. Next since you've downloaded from a linux server most likely (and hopefully) make sure after you extract your logfile (usually from a gz file) that you first open the log in Wordpad. Just open, hit the save button, THEN open in notepad. Doing that will re-format the file so you can use in notepad. I use notepad afterwards because on my system (which is highend) notepad simply appears to be much faster.

#*$!.#*$!.xx.xxx - - [31/Jul/2004:06:56:20 +0000] "GET /gamer/du HTTP/1.1" 404 1690 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 95) VoilaBot BETA 1.2 (http://www.voila.com/)"

All lines start with an ip address, then the date and time, the +0000 is the timezone (I believe), then the method (GET) and the path requested.

The path requested is AFTER your root. So "/gamer/du" is really "http://www.example.com/gamer/du"

Next the HTTP version. Then the HTTP code (in this case 404 which means file not found obviously). After the HTTP code is the number of bytes the transfer used and if they don't make sense, find a file, write down it's size, and in your log find the next version of that file (make sure both examples have a 200 code which means "ok" and in that case you'll see the same number of bytes for each instance of that request. Add those up to get your monthly bandwidth. Next are the OS and Browser UA (UserAgent). This UA is a spider called Voilabot which thankfully tells us where to get information.

Anyway a good method to manually look for 404s is to find " 404 " with a space before and after the "404" as other parts of your access log may contain the numbers 404 elsewhere.

Now here is an example of abuse that you can spot in your access logs.

xxx.xxx.xxx.xx - - [31/Jul/2004:09:21:14 +0000] "GET /cgi-bin/guestbook.cgi HTTP/1.1" 404 1690 "http://www.active-scripts.net/cgi-bin/countsiteclick3.pl?site=3383" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)"

This is an example of a retard bot that is faking it's UA (to look like they are using netscape, though they MIGHT actualy be using it but unlikely).

The way you can tell this is abuse is that this ip (if you saw this log file) was the ONLY logged instance of this IP. Since it went straight for a guestbook file (Which though didn't exist) tells us that they are interested in guestbooks. Why would someone be interested in guestbooks? Email addresses...and who likes those? Spammers!

If you setup a bad bot-trap you could in this instance now setup a 301 redirect for requests for this non-existent file to redirect the bot to the trap file. (I won't explain bot traps though).

Anyway this file tells us that the referrer was from active-scripts.net. But we know the referrer and have no clue where AW is finding 404s without referrers still!

Here is example of a 404 without a referrer!

66.196.90.70 - - [31/Jul/2004:09:25:13 +0000] "GET /gallery/matrix/bonus/viewer-038.htm HTTP/1.0" 404 1690 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
Now I see says the blind man to his deaf dog! Yahoo is what is causing non-referrer 404s. Yahoo is a better crawler then google (though search results are by opinion yet yahoo crawls 2-3 times more and deeper then google).
Yahoo will crawl your links even months after it writes them down to be crawled later.
I have yet to spot an instance of when yahoo makes a double 301 request.
A 301 is an apache redirect. I believe Yahoo understands HTTP request codes. It will keep looking for that file until 1.) it gives up ~or~ 2.) you setup a 301. Once it sees that 301 code not only will it stop (from my understanding) requesting the 404 but will then request the file you set the 301 redirect to redirect it to.
Here is where Apache comes in REALLY handy. Instead of having dosens, hundreds, even thousands of 404s you can setup 301s in place. I actually find it fun and it's a VERY professional thing to do if you are learning and yet have the need to change file names.
Changing file names is not a bad thing. While changing a file name could be like cutting yourself and then bleeding a little you can always put a bandaid on that cut.
In your FTP (as you're not using some lame file browser via a control panel I hope) you should be able to see a file in your root public directory called ".htaccess" No file name, just an eight letter extension. If your FTP does not show this you either don't have one OR your FTP does not have the "show hidden files" filter turned on. Look around your FTP for filters and add "-a".
My base folder is "public_html" so the index.php file there is my .com frontpage for example. This file should be there as well.
An IMPORTANT note but with an easy fix! If you mess up your .htaccess file your site will completely be disabled. If you have a lot of traffic try this when you know your traffic is light. If you upload and reload your site only to see an error there are two VERY quick things you can do to avoid having this error be seen by others.
1.) Delete the .htaccess file on the server. IT does not HAVE to be there and better it's not then one that's keeping people from seeing your site when you're testing stuff out for your first time.
2.) Upload a blank version. Usually you can at least see your local copy of the file but if you can't figure the filter out then you can just CTRL+A, CTRL+X, CTRL+S, upload, CTRL+V and figure it out from there.
Now here is an example of a 301.
With a completely blank .htaccess file just add this line in.
Redirect permanent /home/NAME=none http://www.example.com/home/

Thats it! This is an example of a retard bot request that I setup a 301 for.
With htaccess you can duplicate this for each 404 you see. With a site that is supposed to make money this is a VERY good way to make sure customers FIND the products they are looking for if you moved the file for example. Now you regain that chance of them wanting to buy it versus seeing "I can't find it, all is doomed!" message or whatever.
I hope this helps! :-D
[edited by: encyclo at 9:31 pm (utc) on July 14, 2005]

tedster

7:17 pm on Jul 14, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

If you watch your server log's 404 errors, you may find that a number of links exist around the web to a url you do not have, or used to have but no longer have for some reason. You may be able to create a new page for that address that almost instantly ranks because of those inbound links. I've done it several times - it's like "found money".

Lowkei

7:37 pm on Jul 15, 2005 (gmt 0)

10+ Year Member

aha...I don't know about the "http" thing.Thanks :)

Anyway a good method to manually look for 404s is to find " 404 " with a space before and after the "404" as other parts of your access log may contain the numbers 404 elsewhere.

jab creation,what do you mean by a space before and after the "404"?Can you please explain more?:) Oh ya,the log file right,what can I actualy obtain from it regarding the 404?I don't actually understand that.I have somehow get the note pad file which shows the similiar format which you showed.

I'm using AWstat and i got the list of 404 list.But the thing is the URL requested is kind of weird..It appears as this..www.example.com/abc/abc/abc/abc/abc/abc.htm.And the correct URL should be www.example.com/abc.htm.The same"abc"keep on repeating. :( And sometime the URL requested come with symbols such as "20%" --www.example.com/a20%b20%c20%.htm. Any hints on that?

Anyway,thanks you guys for the input!:) Sorry if I have asked stupid question. :)

encyclo

1:23 am on Jul 16, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Sorry if I have asked stupid question

It's not stupid at all, Lowkei, it's a good question. :)

www.example.com/abc/abc/abc/abc/abc/abc.htm

This first error looks like an infinite loop produces by a server-side redirect. Are you redirecting any pages in you .htaccess, for example?

www.example.com/a20%b20%c20%.htm

In this case, the

%20

is an encoded empty space, so you should look for links with spaces in them.

A useful tool for automatically checking links is the W3C link checker:

[validator.w3.org...]

You can check recursively (down several levels) your whole site.

Lowkei

3:32 am on Jul 16, 2005 (gmt 0)

10+ Year Member

encyclo,thanks for that.I think I have get it.I will try settling that first..Hehe..too much to test :)

Anyway,I don't think i redirect any pages as I don't know yet using redirect :) thanx for the input.

JAB Creations

3:42 am on Jul 16, 2005 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

A space as in a space. Notice the two spaces around the 404 inside the quotes. There are five charecters, two are spaces, two fours, and one zero.

Wierd referers can be from screwy "search" engines. Amatuers who are using other engine's results and attempting to make the illusion that they have their own index and search capabilites. In that process they create bad code and eventually someone actually uses that "search" and finds your site.