Forum Moderators: phranque
header ("HTTP/1.0 404 Not Found");
I thought this should be enough to tell the Google Spider that the .html version doesn't exist anymore, but today I have discovered that the .html pages are still in the index. Obviously, the Googlebot follows the redirections in spite of the 404 header.
What am I doing wrong? Does anybody have a suggestion?
Thanks!
You can find more here:
[google.com...]
Also, if it's possible to send a 301 code that is respected by Google, it should also be possible to send a comprehensible 404? (In other words, maybe my synthax is wrong, or there is a more effective method?)
Your method should conform to the intended use of the server codes:
404 - Not Found (This means the user typed in in an invalid URL, or the webmaster needs to fix an error on the site)
301 - Moved Permanently (Should be used when you rename a page or resource)
410 - Gone (The resource is gone and has not been replaced)
If you send a proper code in response to a search engine spider's request, most of them will do the right thing. 410's will be removed from their databases quickly, 404's a bit slower (since it may be a mistake), and page/resource URLs returning a 301 will be updated in their database. You may still see occasional requests for missing/relocated resources if the spiders find links to them on other sites, but using the correct codes does help.
Jim
The problem is that it just doesn't work. The html page I have removed >6 months ago is still in there, and now it has the title/description of the .php page. In other words, the Googlebot simply goes beyond...
header ("HTTP/1.0 404 Not Found");
...and follows the redirection. I hope you see my point - it's not a question whether Google removes this entry from the database more or less quickly, but in spite of the 404 header Google believes that the .html page is still there and updates it according to what it finds on the .php page.
>>header ("HTTP/1.0 404 Not Found");
>>...and follows the redirection
How is that possible? Not found and redirection are mutually exclusive. Either you send a 404 code to tell the UA that the resource was not found or you return a 30[12] status code and a Location header field with the URL where the resource can be found. If you send a 404 status code there should be no such Location header. And even if there were one, any HTTP compliant UA would ignore it, since it has no defined meaning for a 404 status code.
Andreas
Yet, I share andreasfriedrich's confusion. How is that possible?
Here is the relevant part of my php 404 file:
<?php
header ("HTTP/1.0 404 Not Found");
$uri=getenv("REQUEST_URI");
if (eregi (".html", $uri))
{
$uri = str_replace(".html", ".php", $uri);
header("Location: $uri");
}
else
{
header("Location: [mydomain.com...]
}
?>
1. The first part is handling files that have been "permanently moved", so sending a 301 code would be appropriate. The second part is supposed to handle true 404 errors, so here sending a 301 code would seem wrong. Yet, the only ones I am sending the 404 code for (Search Engine Spiders) don't seem to get the message, and I don't know why.
2. Are you sure that if I follow the same path as above, send a 301 header in the first place and then a redirect header later on that the Googlebot (and other spiders) will understand what I want. After all, with the 404 header, they obviously didn't... :)
Andreas' question had to do with 404 responses, not 301s. A 404 response *can* contain a new resource URL, but it is not part of the header, it is appended as part of the body of an informational page. NO user-agent is required to process that URL, but some (like Google) do.
Best results will be obtained if you follow the formal procedure described in the RFC to the letter, rather than basing your method on "what seems to work for me."
If the resource is totally gone, send a 410.
If the resource has moved or has been replaced with an equivalent, send a 301, plus a new location.
If a user "invents" a URL and the resource does not exist, or if your php script can't handle a situation and fails, make sure your server sends a 404 as a last resort. This should be done as a server configuration, not within a script.
That's how it is supposed to work, and in fact, does work - I have moved/replaced/deleted many pages without ever having any of the problems you describe.
Reading the server response code section of the RFC I cited above may be helpful to you. Read especially the recommendation for sending a 410 when it is known that the resource is gone and has not been replaced. (The idea is that a 404 means that a resource is unavailable for unknown reasons. You should not use it "on purpose" in a script to redirect users - It should be reserved for truly unknown problems. Webmasters who do not include any custom error handling tend to think that 404 is a "normal" code, but you are beyond that and are trying to implement an advanced handler. Your handler should return 301 or a 410 as appropriate, and the server itself should return 404 if and only if your script cannot determine what to do - maybe never.)
I've never used PHP, but I would suggest something more like this (this may not work, it's just an example):
<?php
$uri=getenv("REQUEST_URI");
if (eregi (".html", $uri))
{
header ("HTTP/1.0 301 Moved Permanently");
$uri = str_replace(".html", ".php", $uri);
header("Location: $uri");
}
else
{
header ("HTTP/1.0 410 Gone");
}
?>
Best,
Jim
Jimīs code will work, since the 302 status code is not sent if a 3xx status code has already been sent.
Andreas
What you have said about the 302 status code makes complete sense to me. However, one problem remains: how can I redirect all other file not found errors to my start page? Sending a 410 status code seems the correct thing to do, but if I redirect to the main page afterwards, it will turn into a 302 status code and the Googlebot will happily follow it.
Additional problem: I have also developed the habit to link to a page that's not there yet. For example, if I post a new article, I already point all links to the place in the archive where I will move the file a week later. If the file isn't in the archive yet, not problem, my 404.php file will automatically forward to the location where I post my current articles. This system has proven to be very handy, but the last thing I can use in this situation is a 302 status code!
Of course you're right, a 302 status code would be just the currect thing in such a situation. Thanks! However, it seems like a lot of pages out there have been penalized for using redirects, so I'm simply afraid to make a mistake here. Sure that Google likes 302 status codes?
The other problem is that jdMorgan pointed out that I should be sending a 410 status code when a page doesn't exist. Makes sense to me, but how can I redirect to my main page without sending a 302 header instead? Isn't that possible?
Thanks again!
The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.[faqs.org...] (10.4.5 404 Not Found)
If you have a resource that matches those criteria, return a 410 status code. The body of that message you should contain a HTML document with a human readable explanation of the error. There is nothing that prevents you from adding a link to your homepage or even adding the content of you homepage to all error pages and have only a small yet prominent explanation of the error that occured.
As for SEs not liking redirection Google themselves recommend using redirection via 30x status codes. I believe it is safe to use it.
>>how can I redirect to my main page
You canīt. 410 Gone is about informing the UA that the requested resource is gone and that there is nowhere else to look for it. Of course you could use some redirection techniques to redirect from the 410 Gone page to you homepage but I would advise against it.
Andreas