404 error processing

Forum Moderators: phranque

Message Too Old, No Replies

404 error processing

What am I doing wrong?

yosmc

1:28 am on Feb 21, 2003 (gmt 0)

Ages (>6 months) ago I changed my .html pages to .php extensions. I set up a custom 404.php file to redirect visitors to the new locations/file names. The first line of the file says

header ("HTTP/1.0 404 Not Found");

I thought this should be enough to tell the Google Spider that the .html version doesn't exist anymore, but today I have discovered that the .html pages are still in the index. Obviously, the Googlebot follows the redirections in spite of the 404 header.

What am I doing wrong? Does anybody have a suggestion?

Thanks!

kris

2:08 am on Feb 21, 2003 (gmt 0)

I have always heard to use 301 redirects to let Google know the page is permantly removed.

You can find more here:
[google.com...]

yosmc

7:38 pm on Feb 21, 2003 (gmt 0)

Well, the problem is that I use the 404.php file for ALL error processing, and files can miss for various reasons. I have a very sophisticated system of naming my files so I can move them to different categories and the user will still get there. Also, if not indicated otherwise, errors will simply be forwarded to my main page. In all these cases it would simply not seem legitimate to send a 301 code, and I fear that there would even be a slim chance that Google would be upset if I did.

Also, if it's possible to send a 301 code that is respected by Google, it should also be possible to send a comprehensible 404? (In other words, maybe my synthax is wrong, or there is a more effective method?)

jdMorgan

8:04 pm on Feb 21, 2003 (gmt 0)

yosmc,

Your method should conform to the intended use of the server codes:

404 - Not Found (This means the user typed in in an invalid URL, or the webmaster needs to fix an error on the site)
301 - Moved Permanently (Should be used when you rename a page or resource)
410 - Gone (The resource is gone and has not been replaced)

If you send a proper code in response to a search engine spider's request, most of them will do the right thing. 410's will be removed from their databases quickly, 404's a bit slower (since it may be a mistake), and page/resource URLs returning a 301 will be updated in their database. You may still see occasional requests for missing/relocated resources if the spiders find links to them on other sites, but using the correct codes does help.

Jim

yosmc

8:35 pm on Feb 21, 2003 (gmt 0)

Jim,

The problem is that it just doesn't work. The html page I have removed >6 months ago is still in there, and now it has the title/description of the .php page. In other words, the Googlebot simply goes beyond...

header ("HTTP/1.0 404 Not Found");

...and follows the redirection. I hope you see my point - it's not a question whether Google removes this entry from the database more or less quickly, but in spite of the 404 header Google believes that the .html page is still there and updates it according to what it finds on the .php page.

jdMorgan

8:50 pm on Feb 21, 2003 (gmt 0)

yosmc,

Use a 301 - That's what the 'bot expects in this case.
A 404 means the resource is not available, but you gave it a link to a page, so it listed it.
A 301 is the proper response.

See RFC2616 at ftp://ftp.isi.edu/in-notes/rfc2616.txt

Jim

andreasfriedrich

8:53 pm on Feb 21, 2003 (gmt 0)

>>simply goes beyond...

>>header ("HTTP/1.0 404 Not Found");

>>...and follows the redirection

How is that possible? Not found and redirection are mutually exclusive. Either you send a 404 code to tell the UA that the resource was not found or you return a 30[12] status code and a Location header field with the URL where the resource can be found. If you send a 404 status code there should be no such Location header. And even if there were one, any HTTP compliant UA would ignore it, since it has no defined meaning for a 404 status code.

Andreas

yosmc

10:07 pm on Feb 21, 2003 (gmt 0)

Ok, I've heard and understood the fact that I should probably be sending a 301 code.

Yet, I share andreasfriedrich's confusion. How is that possible?

Here is the relevant part of my php 404 file:

<?php
header ("HTTP/1.0 404 Not Found");
$uri=getenv("REQUEST_URI");
if (eregi (".html", $uri))
{
$uri = str_replace(".html", ".php", $uri);
header("Location: $uri");
}
else
{
header("Location: [mydomain.com...]
}
?>

1. The first part is handling files that have been "permanently moved", so sending a 301 code would be appropriate. The second part is supposed to handle true 404 errors, so here sending a 301 code would seem wrong. Yet, the only ones I am sending the 404 code for (Search Engine Spiders) don't seem to get the message, and I don't know why.

2. Are you sure that if I follow the same path as above, send a 301 header in the first place and then a redirect header later on that the Googlebot (and other spiders) will understand what I want. After all, with the 404 header, they obviously didn't... :)

jdMorgan

10:43 pm on Feb 21, 2003 (gmt 0)

yosmc,

Andreas' question had to do with 404 responses, not 301s. A 404 response *can* contain a new resource URL, but it is not part of the header, it is appended as part of the body of an informational page. NO user-agent is required to process that URL, but some (like Google) do.

Best results will be obtained if you follow the formal procedure described in the RFC to the letter, rather than basing your method on "what seems to work for me."

If the resource is totally gone, send a 410.
If the resource has moved or has been replaced with an equivalent, send a 301, plus a new location.
If a user "invents" a URL and the resource does not exist, or if your php script can't handle a situation and fails, make sure your server sends a 404 as a last resort. This should be done as a server configuration, not within a script.

That's how it is supposed to work, and in fact, does work - I have moved/replaced/deleted many pages without ever having any of the problems you describe.

Reading the server response code section of the RFC I cited above may be helpful to you. Read especially the recommendation for sending a 410 when it is known that the resource is gone and has not been replaced. (The idea is that a 404 means that a resource is unavailable for unknown reasons. You should not use it "on purpose" in a script to redirect users - It should be reserved for truly unknown problems. Webmasters who do not include any custom error handling tend to think that 404 is a "normal" code, but you are beyond that and are trying to implement an advanced handler. Your handler should return 301 or a 410 as appropriate, and the server itself should return 404 if and only if your script cannot determine what to do - maybe never.)

I've never used PHP, but I would suggest something more like this (this may not work, it's just an example):


<?php
$uri=getenv("REQUEST_URI");
if (eregi (".html", $uri))
{
header ("HTTP/1.0 301 Moved Permanently");
$uri = str_replace(".html", ".php", $uri);
header("Location: $uri");
}
else
{
header ("HTTP/1.0 410 Gone");
}
?>

I'm apparently not doing so well at making the concepts of 301 vs. 410, vs. 404 clear, so I'll leave you to the RFC and hope it helps.

Best,
Jim

andreasfriedrich

11:21 pm on Feb 21, 2003 (gmt 0)

Well, I am no longer confused. Your source code helped to enlighten (hope that�s not too much of a political statement ;) me. The second header() [php.net] call with the Location string will override the 404 status code since using the Location header automagically sends a 302 status code. See the PHP [php.net] manual.

Jim�s code will work, since the 302 status code is not sent if a 3xx status code has already been sent.

Andreas

jdMorgan

11:33 pm on Feb 21, 2003 (gmt 0)

Wow! - A lucky guess... :)
Jim

yosmc

12:29 am on Feb 22, 2003 (gmt 0)

First of all: Thanks everybody for the help!

What you have said about the 302 status code makes complete sense to me. However, one problem remains: how can I redirect all other file not found errors to my start page? Sending a 410 status code seems the correct thing to do, but if I redirect to the main page afterwards, it will turn into a 302 status code and the Googlebot will happily follow it.

Additional problem: I have also developed the habit to link to a page that's not there yet. For example, if I post a new article, I already point all links to the place in the archive where I will move the file a week later. If the file isn't in the archive yet, not problem, my 404.php file will automatically forward to the location where I post my current articles. This system has proven to be very handy, but the last thing I can use in this situation is a 302 status code!

yosmc

12:09 pm on Feb 24, 2003 (gmt 0)

Anybody who can help with this one? :)

andreasfriedrich

1:34 pm on Feb 24, 2003 (gmt 0)

>>the last thing I can use in this situation
>>is a 302 status code!

Why is that? The URL identifying your article is the URL pointing to the archive. Temporarily, i.e. while the article is new, it is available at the current article URL. I do not see a problem with that.

Andreas

yosmc

2:15 pm on Feb 24, 2003 (gmt 0)

Ok, good question! :) Looks like I mixed up 301 and 302 codes. Stupid me...

Of course you're right, a 302 status code would be just the currect thing in such a situation. Thanks! However, it seems like a lot of pages out there have been penalized for using redirects, so I'm simply afraid to make a mistake here. Sure that Google likes 302 status codes?

The other problem is that jdMorgan pointed out that I should be sending a 410 status code when a page doesn't exist. Makes sense to me, but how can I redirect to my main page without sending a 302 header instead? Isn't that possible?

Thanks again!

andreasfriedrich

3:22 pm on Feb 24, 2003 (gmt 0)

The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.
[faqs.org...] (10.4.5 404 Not Found)

If you have a resource that matches those criteria, return a 410 status code. The body of that message you should contain a HTML document with a human readable explanation of the error. There is nothing that prevents you from adding a link to your homepage or even adding the content of you homepage to all error pages and have only a small yet prominent explanation of the error that occured.

As for SEs not liking redirection Google themselves recommend using redirection via 30x status codes. I believe it is safe to use it.

>>how can I redirect to my main page

You can�t. 410 Gone is about informing the UA that the requested resource is gone and that there is nowhere else to look for it. Of course you could use some redirection techniques to redirect from the 410 Gone page to you homepage but I would advise against it.

Andreas

yosmc

8:43 pm on Mar 7, 2003 (gmt 0)

Just a thought: what if I disallow my 404.php file via robots.txt? Wouldn't that even give the most paranoid webmaster some peace of mind? Or wouldn't that work either for some reason?