|Custom Error Pages - Beware the Server Header Status Code|
This is the problem that just won't go away. We've discussed the issue in many threads, but never devoted a topic to it. And it's so widespread that getting it wrong seems to be more common than getting it right!
Although any server can be misconfigured to return an incorrect http status in the server header, Microsoft IIS has a particular liability here. In a nutshell:
- You tried to set up a "custom 404" page, so your users get a friendly, non-geek message.
- When a bad url is requested, the server responds with a 302 redirect to the custom error page (often called 404.html for example). Then the custom message is returned with a 200 OK http status.
The problem here - the server never told the user agent that the status for the original url was supposed to be 404! Instead, it just sent a 302 temporary redirect. Even if the on-page message says "404", that does not mean anything to a crawler, googlebot included.
Under this incredibly common approach to a custom error page, any bad url can be indexed according to the standard handling of an internal 302 redirect:
a) the content of the redirect's target url is indexed
b) with the original url as the location
So the bad urls can start piling up as duplicate urls for the same exact content.
The Microsoft IIS server and .NET platform are particularly vulnerable to this problem. Although the default error handling does return a correct 404 status code, the problem is with the way custom error messages are often set up.
Google does try to catch this problem with test spidering. That's one reason people see googlebot requesting strange urls. If googlebot notices this error handling problem, they generate a notice in Webmaster Tools account. Then the webmaster can't even validate the site or access reports until it gets fixed.
However, it is not wise to hand over this responsibility to Google. Get it fixed, so a 404 actually returns a 404 http status code.
By the way, there are incorrect instructions all over the place about setting up custom redirects this way on IIS. Don't believe them, not even when they occur in hard cover books with the MS seal of approval on them, not even when they occur on Microsoft's own forums, not even when they occur on blog posts of otherwise very savvy people. If the server header isn't 404, then the potential for trouble is there.
Because the details of a fix can depend on many factors, including the version of IIS, please, take up the technical how-to questions in our Microsoft ISS Server forum [webmasterworld.com]. You can use Site Search [webmasterworld.com] to find many related threads that are already published.
[edited by: tedster at 5:42 pm (utc) on April 14, 2008]
For the IIS user, there is one other caution I should mention about 404 handling. If you are using .NET, then there are two levels of error handling: at the IIS level and at the .NET level. It is also common to find that only one of these two levels is set up correctly. So when you're stress testing your URLs, try a bad url with an .asp (.aspx) extension, and also try a bad url with a .htm extension.
I felt it was important to focus on this issue in a dedicated thread - it's coming up much too often, both in threads here and in the sites of new clients that I evaluate.
The problem is by no means confined to IIS, either. For example, I see it on Apache servers running Tomcat for .jsp pages.
[edited by: tedster at 5:40 pm (utc) on April 14, 2008]
never neglect your status headers.
A troubling factor is that the web has become so easy to use, you can be a publisher of thousands of websites without understanding the machinery that makes it work. Status headers are like webmaster 101. But many publishers are driving without their WebmasterWorld diploma!
It behooves every good webmaster to make sure their headers are engineered to properly represent the content being delivered. Serving sites with muddled status headers is like driving with the parking brake on. Yeah, once in a while you'll do it. By accident. What I fear is the driver who doesn't know what that lever does, or why their car is making that squealing noise and burning smell.
WebmasterWorld offers this tool in the Control Panel:
Tedester this is an easy fix even I figured it out. But one overlooked just as easy as it is to fix.
The problem comes in when IIS is set up for the custom 404 page your correct the header doesn't throw a 404 but a 302 to a 200, so all that needs to be added is the code in the head of the page.
works fine on all url's not on the server.
Response.Status = "404 Not Found"
make the 404 page an asp page and the problem is fixed.
Thanks for bring this up I got a good idea from a fellow member on adding seach on the 404 page so will do that now.
I won't give details as to how I set it up but Tedester is correct follow the instructions on your IIS point it at the custom 404.asp page, add the code to the head of the page, check by entering a bad URL.
If the 404 page resolves then check the same url in a header check to make sure the status code throws a 404 and nothing but a 404.
I see this stuff all the time, and it's one of many reasons why every site I deal with is instead hosted on Apache - or changes to Apache if I am going to be doing any work on it whatsoever. I see many such howlers all over the place.
Wordpress offers a 301 Redirect to a 404 page. The 404 page does deliver the proper 404 header. What do you think? Is it sufficient? I'd be more comfortable if the system just delivered 404 from the get-go.
A redirect to some other URL that then delivers a 404 status is far from ideal.
|Wordpress offers a 301 Redirect to a 404 page|
|Does your WW control panel's server header checker tell you you're getting a 404 returned? |
If so, it's fine. if not..
|Wordpress offers a 301 Redirect to a 404 page. The 404 page does deliver the proper 404 header. What do you think? Is it sufficient? I'd be more comfortable if the system just delivered 404 from the get-go. |
Use a status checker (like the webmaster world one) and see if a fake page returns 404.
OK I figured out what is going on.
I have a Redirect in my .htaccess that redirects "example.com" to "www.example.com". It's a plain Regex that adds the www's if they're missing.
It does this redirection before Wordpress figures out whether the page exists or not.
thus this URL:
returns a 404 error.
returns 301, pointing Location to the URL above with the "www", which then returns a 404.
Perfect - no. But sufficient?
good point I myself haven't thought of that at all but did a check it does 301 to the 404 page so it is fine.
As long as the ending page is a 404 you are fine and it is a 301 to it. good to go.
tedster fingers the 302 as the culprit in these status header crimes... what about the venerable 301?
As someone who claims to be an expert on handling bad requests, I'm embarassed not to know whether a chain of 301s ending with a 404 is as good as serving a 404 from the get-go. I have always assumed so.
I doubt I'll be able to sleep soundly until I've cracked open the HTTP spec to confirm this...
y'know, a few years ago, I had problems with IIS and ISAPIRewrite (I forget which version) not doing proper 301's - they were all 302, despite flagging rules with [RP]. It was a real problem. I wonder if it still is?
The oddest status code I've ever dealt with was "999", an unorthodox code returned by certain Yahoo services when you overflow your bandwidth limits. If the HTTP spec doesn't provide an adequate code, are we allowed to make up our own?
Your first post has a 200 return code and the other three are 302's. No they are all indexed and you will soon be penalized from WebmasterWorld for duplicate content problems ;)
The problem is with the 301 set up to the www it is never ever gonna be without the 301. Just like any request to the non www it 301's to the www version.
I am fine with the 301 header reqquest being it comes from the non www with the 404 as the ending page and will be able to sleep tonght.
Hey you better lay off the Dr Peppers got that finger fired up today.:)
|tedster fingers the 302 as the culprit in these status header crimes... what about the venerable 301? |
The indexing problems do seem to be specifically tied to the 302 redirect - that's because the parent url still gets indexed with the target url's content. A 301 redirect does not cause Google to add the originating url to the index, so no such issue ever seems to show up.
In most IIS operations, 302 is the default and you need to check an extra box to make the redirect "permanent". The interface often just calls a 302 status "a redirect" and that lack of clarity is a source of trouble. Even micrsoft.com had this issue in several areas, and finally added some vbscript to fix it.
As I mentiond earlier, Google has aware of this problem for a while and has actively worked to address it. My concern is that even though the bogus urls may no longer show up in the reporting functions, such as the site: operator, they still may be mucking things up in hidden ways.
In short, I no longer trust to site: operator to be accurate as much as I used to,and I much prefer knowing that a true 404 status is actually returned for anything that isn't there.
The canonical 301 redirect (no-www to with-www or vice versa) is what Google recommends, and I've never seen a problem from going from a 301 to a 404. Since I've been having all my clients set up the canonical fix this way for years, both on IIS and Apache, I feel safe with it.
[edited by: tedster at 3:40 am (utc) on April 15, 2008]
Some of the control panel software and big hosting farms do some weird stuff with 404 pages and there are many cases of a page returning 200 named "404.asp" or something equally as obnoxious.
However, with that said, I have witnessed Google, Yahoo and MSN hit my sites with a few random junk page names just to see how your site responds to 404s and if you redirect to a custom 404 page that doesn't return the proper code.
Therefore, I would assume they'll eventually figure out the mis-configured 404 page but it may take quite a bit of time and there's really no excuse for not sending the proper error code in the first place.
TIP: You can make a script to display your friendly 404 page that puts a 404 response code in the HTTP header which works 100% and doesn't depend on any server configuration whatsoever.
[edited by: incrediBILL at 10:02 pm (utc) on April 14, 2008]
A chain of 301 redirects is not ideal, and a chain of 301 redirects that ends in a 404 status is even less so.
Most search engines seem to get a reasonable clue as to how to correctly handle that.
However, as mentioned, if a 302 redirect gets in there, then all havoc can break loose.
This problem is common on Apache as well. There are many servers out there with a custom 404 error page set up like this:
ErrorDocument 404 http://www.example.com/error404.html
The result of this is a 302 redirect response to request a for the non-existent URL, client redirection to http://www.example.com/error404.html, and a response code on error404.html of 200-OK.
This is the documented behavior [httpd.apache.org] when a full ErrorDocument URL is specified as above. The correct syntax needed to avoid this problem uses only a local URL-path:
ErrorDocument 404 /error404.html
tedester I set up my custom 404 just as you said but it didn't throw a good 404. I had to change the name to the 404 page and add the script to the head for the correct header to get thrown.
I set it up my 404 by going into the IIS for the domain selecting the 404 page changing it to point to the 404 page name checked "permanent".
But when I checked the header was a 302 to the 404.
I had to do a bunch of looking to find the fix and it was adding the above script to return the correct header.
My sever is using IIS 6
I am not sure you can set up a good 404 from IIS. I maybe wrong but mine was set up by the book and it didn't work correctly until I added the extra code to fix it.
Even a 302 redirect WITH the appropriate script to change the status of the target custom error message page has not caused duplicate url indexing problems for any site I work with. In other words, 302 to 404 has not caused any problems in practice for me. The problems show up when you serve a 302 to a 200.
A 301 to 200 custom error handling would only mean that the custom error page url gets indexed once. Add a 404 script to that error page, and then it does not even get indexed once - that's what you want.
|I am not sure you can set up a good 404 from IIS. I maybe wrong but mine was set up by the book and it didn't work correctly until I added the extra code to fix it. |
All my IIS sites are running .NET, and they let the IIS errors remain default, which is fine. I suspect that you're right, bwnbwn. You need to manually add a script to the custom error page to get the 404 status.
In header checker, my WordPress fake-page request returns:
HTTP/1.1 200 OK
Date: Tue, 15 Apr 2008 13:12:25 GMT
Status: 404 Not Found
What does this mean and what to do if any?
This problem is very common when a very big web pages (big newspapers) it is necessary to use a service cache. Companies that offer (eg Akamai) always result a redirect 302 for this, there is no solution by the webmaster.
[edited by: Errioxa at 2:44 pm (utc) on April 15, 2008]
rytis, that snippet you posted shows two different status codes, with a 200 showing up first. Very hard to know what's going on with incomplete information, but it looks strange. Read the server header more closely and associate each url request with the status code the server retruns for that specific url.