Forum Moderators: phranque

Message Too Old, No Replies

Correctly Creating a 404 that generates a 404

A little research . . .a few questions . . .

         

rocknbil

10:01 pm on Aug 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



One of the most discussed topics here is the importance of a 404 page returning a "real" 404 header. Amazingly, there are thousands of "custom 404 error tutorials" on the web that don't touch this issue. If you set up a custom 404 using .htaccess, the server will still return a status of 200 OK, because the original status is lost in the redirect. This is a "big deal."

For the unaware, let's take the worst case: you set up a 404 to redirect to the home page. Page not found, put them back on main page. Bad idea, if the search engines capture a few bad links, they are going to index your 404 page - which is duplicate content for your main page, which gives you duplicate content penalties. For your main page. OUCH.

Secondly, many have had trouble creating their Google site verification page because they use a custom 404 page, as Google won't verify a site that returns a 200 for a non-existent page.

So I've spent a few hours this AM looking around. Unfortunately, I was only able to locate two veins of comments on how to get a custom error page to return a 404 header. This first was helpful comments supplying no real resolution:

a URL that does not exist should be returning a 404 [Not Found]

Some domains have improperly configured their custom error page . . .


You need to make sure your custom error page returns a proper 404 header. If you're returning a 200 OK, Sitemaps won't verify your site.

OKAY, we get it, the custom error "page" needs to generate a 404 header. So how do we do that? The reason for quoting "page" became apparent when I encountered this:

DO NOT use static html pages for this -- you can't change the result code on a static html page.

Now we're getting down to it - when we think of "pages" most of us think of static html. An .asp page, php page, or a perl scripts are really dynamic page output and not static pages. Yet almost every resource on the web on custom error pages instructs to create a static page to a custom error page.

<debate>
The second bit of misinformation I encountered that using a forward -slash path in your directive will preserve the 404 status:

ErrorDocument 404 /my_error.html

This is also (apparently?) false, I have tested it thoroughly, an error document at root still returns 200 headers when the page is displayed because it loses the original 404 in the redirect.
</debate>

The second vein of comments has to do with solving this issue but what I still consider workarounds:

Temporarily disable custom 404 error pages for your site/Check if your web server return correct 404 code for non existing pages/verify file/Restore your 404 custom error pages.

This will work for getting past the Google verification, but you still have a permanent issue with non existent pages generating a 200 OK.

Because static HTML won't generate http headers, a scripting method is the suggested usage in most cases.

PHP:


<?php
header("HTTP/1.0 404 Not Found");
?>

ASP:


<%
Response.Status = "404 Not Found"
%>

Perl:


print "Status: HTTP/1.1 404 Not Found\n";
print "Content-type: text/html\n\n";

So here is the question: I still consider these workarounds. I have my perl scripts printing the 404 status where appropriate, but it seems like these shouldn't be necessary. Is there a CORRECT way to simply use the 404 directive to our advantage without relying on an external tool to return the appropriate header?

jdMorgan

10:23 pm on Aug 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I guess I don't understand the question.

If you have a dirt-simple site with no custom error document and no URL-rewriting going on, then any request for a URL that does not resolve to an existing file or directory-index returns a default server-generated 404 page and a 404-Not Found header.

If you declare a custom error document like so:

 ErrorDocument 404 /my-404-page.html 

and still do no URL rewriting, then any request for a URL that does not resolve to a file returns that custom 404 page and a 404-Not Found header.

If you declare your error page like this:

 ErrorDocument 404 http://www.example.com/my-404-page.html 

then you'll get a 302-Found redirect server response when you expect a 404. This behaviour is documented.

Now if you're doing a bunch of rewriting, using content-negotiation, or using AcceptPathInfo to pass control of some or all URL-requests to scripts, then you can get any number of interactions, and many of them will result in a 200-OK response. If a rewrite is too broad (i.e. if its pattern is too ambiguous) this can easily happen. And if a script is used to handle most or all URL-requests, then that script must properly define and handle "missing page" responses.

I have warned against using dynamic error pages, but not because it has any direct effect on the server response code; I'm against introducing additional layers of complexity on top of an error condition, so I say that error documents should be dirt-simple static html pages with few or no external dependencies -- No external images, JavaScripts, CSS, PHP or SSI include files, etc. That is so that if you get one error caused by or affecting any of these potentially-external resources, you won't get a cascade of errors. Such an error cascade can sometimes make the root problem very hard to find.

Jim

[edited by: jdMorgan at 10:24 pm (utc) on Aug. 6, 2007]

jdMorgan

10:27 pm on Aug 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I should also add that if you directly request your error document (example.com/my-404-page.html in the post above), then of course it will return a 200-OK response, because that URL resolves to an existing file.

Jim

rocknbil

2:32 am on Aug 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you again, Jim. :-) Yes I'm aware of the Apache docs referring to http method, requiring a local file as the errorDoc. More terse,

ErrorDocument 404 /my_error.html

Was returning a status 200 when requesting domainname/somenonexeistentpage.html. The question is how to insure it will return a 404 on the server level, rather than create a scripted header - but you are correct, there may be other factors at play in mod_rewrite munging up the headers. I will look into it.

jdMorgan

3:08 am on Aug 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well with that ErrorDocument declaration, and no rewriting (of any kind) or scripts on the server, you will indeed get a 404 response if you request a non-existent page.

So question number one is, "Do you use any scripts on this site?"
And question two is, "If so, how do you transfer control to them?"

If the answer is that mod_rewrite, content-negotiation, or AcceptPathInfo is involved, then each of those brings its own complications and all may involve changing the default server error handling.

For example, the box-stock WordPress mod_rewrite code basically says, "If the requested URL does not resolve to an existing file or directory, then rewrite the request to the WordPress script. So, there goes your 404-handling right there; Any and all non-existent-file requests get passed to WordPress, and unless WordPress internally defines and handles "missing pages" properly, you may never see a 404 on a WP site. (Caveat to WordPress users: I don't use WordPress, so I don't know whether or how it handles errors, and I don't mean to scare anyone with this post.)

Jim

rocknbil

6:19 pm on Aug 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well with that ErrorDocument declaration, and no rewriting (of any kind) or scripts on the server, you will indeed get a 404 response if you request a non-existent page.

Yes it SHOULD, and I thought it did, but some of the posts here are misleading or I've misread them, see my original post - in reference to "sites with their custom error page misconfigured" I understood as a misconfigured static error page somehow returning a 200.

So the original question is moot as you've already answered it, but it was how one would "correctly configure" an incorrectly configured error page. It didn't make a whole lot of sense to me since the headers are generated from the server and once you get to a static page . . it's just a page.

To answer your question - Yessir, redirects are involved and using the same if!-d if!-f mod_rewrite logic - HOWEVER - those scripts are structured as follows:

- Script parses $ENV{REQUEST_URI} for a "friendly URL"

- If a corresponding item is found in the database, set the "id" variable. This allows the system to function with /friendly_url or item_id=1234. Script will continue on by it's normal search mechanism.

- If NOT found, immediately print "Status: HTTP/1.1 404 Not Found" and return to a *helpful* error page that clearly states not found and offers other links, including a new search.

I throughly tested this with various link checkers, it's definitely printing a 404 for a non-existent item and page.

jdMorgan

7:19 pm on Aug 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> If NOT found, immediately print "Status: HTTP/1.1 404 Not Found" and return to a *helpful* error page that clearly states not found

This mechanism may be flawed then, hinging on what exactly you mean by "return to a *helpful* error page."

There can and should be no "returning" involved here. The script must output a 404-Not Found status and then either output the error page contents directly, or "#include" it from another file and output it. If you redirect to the error page or refer to it by URL instead of by filepath, then that page will be served with a 200-OK status, because a redirect terminates the current HTTP transaction and starts a new one. The redirect status will override the 404 status, tell the client to ask for the error page URL using HTTP, and terminate the current HTTP transaction. Then the client will request the error page in a new HTTP transaction, and get a 200-OK.

I just get the feeling that there's something procedurally wrong with your error-handling; The above analysis/guess may not be right, but this isn't an Apache problem it's a script problem.

Also, beware of on-line link checkers; Some show only the final request/response status. I recommend the "Live HTTP Headers" extension for Firefox/Mozilla for a clear, straightforward view of the client/server transactions.

Jim

g1smd

12:20 am on Aug 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>> If you set up a custom 404 using .htaccess, the server will still return a status of 200 OK, because the original status is lost in the redirect. <<

Serving the 404 error page is NOT a redirect.

If I ask for domain.com/doesnotexist.html I should be served the content of the file at /error404.html but the URL as shown in the browser should NOT change.

However, IF you specify the error page location using a domain name in the Error Directive it will cause a 301 redirect to be issued.

Do not include a domain name in your Error 404 directive. Specify only the filepath to the file as seen internally on the server that pulls the file.

A redirect is where the URL as seen in the browser changes, because the browser was forced to re-request a new URL. An error page is not a redirect.

g1smd

12:25 am on Aug 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If someone directly accesses domain.com/error404.html then that URL will return "200 OK" because there really is a file at that location.

In that case simply use the <meta name="robots" content="noindex"> tag to ensure that the error page itself can never be indexed.

rocknbil

9:58 pm on Aug 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This mechanism may be flawed then, hinging on what exactly you mean by "return to a *helpful* error page."

My bad, I didn't mean return to, I mean the script continues on to output the not found information.

print "Status: HTTP/1.1 404 Not Found\n";
print "content-type:text/html\n\n";
print "not found, try these links or this search";

Thanks for your patience Jim, it's always helpful to just put the thoughts out in a post. :-)

Asia_Expat

8:10 am on Aug 13, 2007 (gmt 0)

10+ Year Member



May I participate in this thread... I have my own concerns and a specific issue...

I've been setting up my own custom error for some time and indeed, I think I have it working well now, except for one problem.
First, I will describe what I have done, for you to analyse...

I created the error page and named it '404.php'.
At the very top of the source for that page, I added <?php header("HTTP/1.0 404 Not Found");?>

Thereafter, I used the Errordocument directive in my htaccess file, which now looks like this...


RewriteEngine on
Options +FollowSymlinks
AddType application/xhtml+xml .xhtml
AddType x-mapp-php4 .html .htm .xhtml
rewritecond %{http_host} ^example.com [nc]
rewriterule ^(.*)$ http://www.example.com/$1 [r=301,nc]
RewriteCond %{QUERY_STRING} .
RewriteRule ^index\.php$ /this_file_does_not_exist.html? [L]

ErrorDocument 404 /404.php

Everything seems to work fine. 404 headers are created and the url's do not change or redirect or anything like that.
However, the custom error page I created ONLY shows up if the non-existent URL is .php
If the extension is .html, or any other extenstion, the 404 header is still produced but not with the custom error page. The server just produces the standard server generated error message.

Do you have any idea why this might be happeneing and how I can correct it?

[edited by: Asia_Expat at 8:12 am (utc) on Aug. 13, 2007]

jdMorgan

2:07 pm on Aug 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> If the extension is .html, or any other extension, the 404 header is still produced but not with the custom error page.

The problem may be that the request is for some other filetype, but is being internally rewritten (by the 404 error handler itself) to the custom 404 error page, which is PHP. The PHP error page will think that it was called with the original URL, not its own filepath. So this may cause it to malfunction.

Alternatively, it may be that the invocation of the PHP script during error-handling does not re-invoke the previously-completed phase where AddType and AddHandler are evaluated, meaning that if a page is requested with a .html extension, the error page won't be parsed for PHP code.

It is also possible that there is an error in the server configuration -- But what, I can't guess.

I have made the recommendation many times that unexpected-error-handling should be kept very simple, and that it should introduce no further dependencies into what is already a problem. For this reason, I recommend plain, static, HTML error documents with no external dependencies -- no images, no external includes, no scripts.

Jim

g1smd

9:08 pm on Aug 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>> At the very top of the source for that page, I added <?php header("HTTP/1.0 404 Not Found");?> <<

>> Thereafter, I used the Errordocument directive in my htaccess file, which now looks like this... <<
>> ErrorDocument 404 /404.php <<

This generates TWO headers. The ErrorDocument directive already makes it generate the 404 header. I am not sure why you need the script to generate yet another one.

Additionally, it returns HTTP/1.0; and that may clash with the other header which I assume might return HTTP/1.1.

I am not sure of the outcome of that; but I expect that some browsers or bots may have a problem with that.

.

What is this?

RewriteRule ^index\.php$ /this_file_does_not_exist.html? [L]

I don't like it.

I redirect index file filenames to "/".

jdMorgan

9:33 pm on Aug 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The PHP status header should overwrite the server's header.

The purpose of the code rewriting to "/this_file_does_not_exist.html?" is to create a 404 if index.php is requested with any (non-blank) query string. By rewriting the requested URL-path to a known-non-existent file, a 404 response is forced.

If a 410-Gone is desired, then the "RewriteRule ^index\.php$ - [G]" construct can be used. But a 404 was what the OP requested.

Jim

g1smd

10:24 pm on Aug 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I assume that there is still a Duplicate Content problem from both "/" and "/index.php" resolving in that case?

jdMorgan

10:53 pm on Aug 13, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Possibly, but that is off-topic to the discussion of why he's having difficulty generating a correct 404 response. Such 'auxilliary subjects' are best left until the OPs main question is answered.

Jim

Asia_Expat

5:21 am on Aug 14, 2007 (gmt 0)

10+ Year Member



I used to have a redirect from index.php to / but I removed it, I can't actually remember why I removed it, it was probably interfering with something more important... and I decided that even if there was a dupe happening, it was only going to be with the index page and not the main content pages.

-------------------------

EDIT: I just added that back into the htaccess in the root and it doesn't seem to have broken anything and it now all looks like this...


RewriteEngine on
rewritecond %{http_host} ^example.com [nc]
rewriterule ^(.*)$ http://www.example.com/$1 [r=301,nc]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php\ HTTP/
RewriteRule ^index\.php$ http://www.example.com/ [R=301,L]
RewriteCond %{QUERY_STRING} .
RewriteRule ^index\.php$ /this_file_does_not_exist.html? [L]

ErrorDocument 404 /404.php

However, it only redirects the main index page and is useless for all other directories, or pages that have html or xhtml extensions. I did have one in place that covered all index.php pages in sub directories but it broke the registration page of my forum. I wonder if there is some way to redirect all index pages to / by putting a specific htaccess command in each individual directory?

-------------------------

Regarding the 404, yes, I kept the page as simple as possible, there are no includes, relative url's or javascript... but I wonder if my issue has something to do with forcing my html pages to run through the php parser? (which is what I do to make php execute on all pages, as I use a lot of includes for easy management).

(BTW thanks for your comments, I really appreciate the help and suggestions)

[edited by: Asia_Expat at 6:02 am (utc) on Aug. 14, 2007]

Asia_Expat

5:32 am on Aug 14, 2007 (gmt 0)

10+ Year Member



... incidentally, I just changed
<?php
header("HTTP/1.0 404 Not Found");
?>

to...

<?php
header("HTTP/1.1 404 Not Found");
?>

... but it had no effect.

g1smd

8:36 pm on Aug 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The index file redirect needs to go before the one that is first now.

As you have it now, there will be a redirection chain for non-www index files: one redirect to www and then another to remove the index file filename.

jdMorgan

9:34 pm on Aug 14, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> but I wonder if my issue has something to do with forcing my html pages to run through the php parser?

You could try using filetypes that are not parsed for PHP to name your custom error pages. For example, if .html files are parsed for PHP, then name your 404 page "404error.htm" or even "404.err" -- The search engines won't care, because they will never (normally) ask for your error pages by name.

If you do use an odd file extension like ".err", you'll need to add


AddType text/html .err

to .htaccess so that the server will know what Content-type header to send with a .xyz file's contents.

Jim

Asia_Expat

11:42 pm on Aug 14, 2007 (gmt 0)

10+ Year Member



I gave that a try... there was no effect...

However, I have confirmed the cause... It is definately being caused by running html pages through the php parser. I removed AddType x-mapp-php4 .html .htm .xhtml from the htaccess file and the custom 404 started appearing when it was supposed to, in place of the standard 404, for all extensions. Of course, none of my php includes worked any more, so removing that is not an option.

I guess there's no way around this problem :-(?