Forum Moderators: phranque

Message Too Old, No Replies

My rewrite rule swallows 404 errors

Redirect should return 404, but returns 302 instead

         

stevewillis

2:51 pm on Oct 4, 2005 (gmt 0)



Greetings!

I'm having a hard time with a rewrite rule. I want to redirect any URL that does not begin with a particular directory name. I am using this as my rule:


RewriteCond %{HTTP_HOST} ^.*mydomain\.com [NC]
RewriteCond %{REQUEST_URI}!^/SOME_DIR$
RewriteRule (.*) http://www.mydomain.com/SOME_DIR/ [R,L]

The result is that any URL entered that does not begin with [mydomain.com...] gets redirected to the SOME_DIR index page. The problem is that invalid URLs are now returning 302 status codes where they were returning 404 before the above rule. For example:


http://www.mydomain.com/SOME_DIR/bogus.html

now returns 302, even though it starts with SOME_DIR. I don't understand why invalid URLs that correctly match the pattern are being rewritten. I need 404 errors to work for invalid URLs that DO begin with SOME_DIR.

Thanks!

jdMorgan

4:32 pm on Oct 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



stevewillis,

Welcome to WebmasterWorld!

You'll need to add a check for 'file exists' then. Something like this:


RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{REQUEST_URI} !^/SOME_DIR$
RewriteCond /SOME_DIR%{REQUEST_FILENAME} -f
RewriteRule .* http://www.example.com/SOME_DIR/ [R=301,L]

In this case, the redirect is not invoked if the file /SOME_DIR/requested-local-URL-path does not exist.

File-exists checking is inefficient, and so should be done only when necessary -- as the *last* RewriteCond in this example.

I changed the redirect to a 301 to avoid the well-known problems with Google's handling of 302s, and cleaned up a few more instances of unecessary regular-expressions tokens, like ^.* and the unused back-reference, in the interest of efficiency.

In some cases, it is necessary to use the construct


RewriteCond %{DOCUMENT_ROOT}/SOME_DIR%{REQUEST_FILENAME} -f

in order for this to work. It depends on your server configuration.

If you have trouble
A major problem with this technique is that the expansion of the file-exists check is invisible. So if it doesn't work, it's hard to figure out why. There's a good possibility that I've got a slash in the wrong place, for example, in which case the resulting malformed URL-path will never exist, and the rule will always be applied.

So, for a temporary test, you can copy the path into a query string, where you can see it in your browser, in order to reveal what is being tested for existence:


RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{REQUEST_URI} !^/SOME_DIR$
RewriteCond /SOME_DIR%{REQUEST_FILENAME} -f [OR]
RewriteCond /SOME_DIR%{REQUEST_FILENAME} !-f
RewriteRule .* http://www.example.com/SOME_DIR/?tested-path=/SOME_DIR%{REQUEST_FILENAME} [R=301,L]

This will show the path being tested for file-exists, whether it exists (-f) or not (!-f). Then you can try it with and without the %{DOCUMENT_ROOT} path element prepended, and with URLs that you know do exist and URLs that you know don't exist, to see how it behaves. It's usually either a case of working right away, or being a bit of a pain to debug. Hopefully, this will make it easier for you.

For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].

Jim

stevewillis

6:36 pm on Oct 4, 2005 (gmt 0)



Hi Jim,

Thanks for the quick reply! I'm still having some problems, and perhaps you can clarify my thinking on this. Consider my original rewrite rule (which you cleaned up for me):


RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{REQUEST_URI} !^/SOME_DIR$
RewriteRule .* http://www.example.com/SOME_DIR/ [R=301,L]

You added a test for file existence to the lines above, but I don't think that is working. Using just the three lines above, shouldn't a REQUEST_URI that starts with "/SOME_DIR" never be rewritten? The problem is that a URL of this form:

http://www.example.com/SOME_DIR/bogus.htm

should return a 404 error. It shouldn't be rewritten by the three lines above, because it does start with /SOME_DIR, but should be 404 because bogus does not exist. It works this way when I remove the three lines above from my httpd.conf, but when I add them, I get 302 instead. It seems that even if a REQUEST_URI begins with /SOME_DIR, it is being rewritten to return a 302.

I actually only need to catch one particular 404 (I should have mentioned this in my original message, but didn't want to complicate things.) The three lines above do exactly what I want, but they break a Java applet on my site. Because of a bug in Java 1.5, the applet will always send a request for:


http://www.example.com/SOME_DIR/META-INF/services/javax.xml.parsers.DocumentBuilderFactory

This request must result in a 404 error. If it results in some other HTTP code, it tries to load the page returned as though it were Java code and, of course, fails. If there is some way to modify my rules above so that any request for that particular document returns a 404, it would fix all my problems. I tried this, but it doesn't seem to work:

RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{REQUEST_URI} !^/SOME_DIR$
RewriteCond %{REQUEST_URI} !^DocumentBuilderFactory$
RewriteRule .* http://www.example.com/SOME_DIR/ [R=301,L]

I would think that the above code would not rewrite any URL that contains "DocumentBuilderFactory", but, here I am! If there is an explicit way to force any URL that contains "DocumentBuilderFactory" to return a 404, that would be perfect.

By the way, I'd like to thank you for all your efforts on this site. I know that you hear only the problems people are having most of the time. I'd like to say that lurking on your forums has helped me solve all the many other rewrite rule questions I've had. It is much appreciated!

[edited by: jdMorgan at 8:02 pm (utc) on Oct. 4, 2005]
[edit reason] Example.com [/edit]

jdMorgan

8:00 pm on Oct 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Because your URL

http://www.example.com/SOME_DIR/META-INF/services/javax.xml.parsers.DocumentBuilderFactory

has a local URL-path that starts with "SOME_DIR" it should not be affected by your code. Adding the special exclusion is not necessary.

However, if it *were* necessary, your pattern would not work because you have start-anchored it with "^". Since the REQUEST_URI in this case would be "/SOME_DIR/META-INF/services/javax.xml.parsers.DocumentBuilderFactory" it would not *start* with "DocumentBuilderFactory" and so the start-anchored pattern would never match.

Because you are seeing 'strange' behaviour associated with that special URL, I suspect that it has been aliased or proxied to another URL, and that a 302 redirect is being applied by *some other code*. If it was being affected by the code I posted, you'd be seeing a 301, not a 302.

As a test, you could try rewriting that request to a known-non-existent URL-path, thus creating a 404:


RewriteRule ^SOME_DIR/META-INF/services/javax\.xml\.parsers\.DocumentBuilderFactory$ /abc_123_this_here_file_will_never_be.hmtl [L]

The other possibility is that the applet is making an internal file request, not an HTTP request. I don't know. But in that case, no .htaccess code will have any effect, because .htaccess only applies to HTTP requests.

Oh, and be sure to flush your browser cache before testing any change to access-control code.

Jim

stevewillis

8:40 pm on Oct 4, 2005 (gmt 0)



Hi Jim,

Again, thanks!

Because your URL [...] has a local URL-path that starts with "SOME_DIR" it should not be affected by your code. Adding the special exclusion is not necessary.

I totally agree, which is why I am so confused. The fact is, when I comment out all three lines of my code, accessing my URL returns a 404 error as it should, but with it, it is 30X. This URL definitely does not exist:

http://www.example.com/SOME_DIR/META-INF/services/javax.xml.parsers.DocumentBuilderFactory

If it was being affected by the code I posted, you'd be seeing a 301, not a 302.

Sorry...I'm still testing with 302 so I don't have to clear my browser cache every time I make a mistake. I intend to change to 301 when everything is working. Please consider all mention of "302" in my previous posts to be "301".

The other possibility is that the applet is making an internal file request, not an HTTP request.

No, the request shows up in my Apache access_log, so it is an actual HTTP request and not a local file access request. With my code it appears as a 30X, and 404 without it.

By the way, I am not putting any of this code in an .htaccess file. It is directly in the VirtualHost directive of my httpd.conf file, where all my other (working) rewrites are. I wouldn't think this would matter, though.

I'm really glad that you think this should work the way I think it should. At least that means I have a real problem, and not just a stupid regular expression error! If I could impose to ask you just one other question: since we both think that the code above shouldn't make a difference for requests starting with "/SOME_DIR", but it does, is there any "quick fix" way to return a 404 status any time a particular URL is requested? Again, I don't need 404 to work on any other URLs...just this one in particular, so my applets stop freaking out.

stevewillis

8:47 pm on Oct 4, 2005 (gmt 0)



Oh, I forgot to address this point you made:

Because you are seeing 'strange' behaviour associated with that special URL, I suspect that it has been aliased or proxied to another URL, and that a 302 redirect is being applied by *some other code*.

This behaviour is not unique to this one URL. If I comment out my code, typing in any random, non-existent resource path returns a 404 properly:

http://www.example.com/SOME_DIR/blah_blah_blah.htm

...etc.

If I uncomment my code, the above (completely bogus) URL redirects to:

http://www.example.com/SOME_DIR/

...which is fine, except for one particular request, which MUST return a 404 when it doesn't exist because of a Java bug. That's the only thing that makes this particular URL special...a human can just be redirected to my main page when they type an invalid location, but that one nonexistent URL has to return 404. That's why I don't care whether the solution explicitly returns a 404 for that one URL, or whether all nonexistent resources return 404 properly.

jdMorgan

9:32 pm on Oct 4, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, just brute-force it then:

RewriteCond %{REQUEST_URI}!DocumentBuilderFactory
RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{REQUEST_URI}!^/SOME_DIR$
RewriteRule .* http://www.example.com/SOME_DIR/ [R=301,L]

If this does not work, then that URL is being redirected, somehow, somewhere, before this code gets control.

Jim

stevewillis

3:59 pm on Oct 5, 2005 (gmt 0)



Hi Jim,

Still no dice. Because there is no other rewrite code (my httpd.conf is about as plain as it gets), I'm not sure what's going on. I have a new theory, though. Is it possible that the original rewrite rule is working correctly, but because the error documents do not start with SOME_DIR, they are being rewritten? Here's how I'm picturing the flow:

(1) A user requests http://www.example.com/SOME_DIR/bogus.htm, which does not exist.

(2) The request begins with /SOME_DIR, so the rewrite rule does not apply.

(3) The 404 error page is fetched, but the URL for that page does not begin with /SOME_DIR, so the rewrite rule catches it and rewrites it to a 30X.

I think I'm probably wrong about this, because then the 30X would probably be rewritten in the same way and cause a loop. I'm grasping at straws, though. I can't understand why the rule is redirecting /SOME_DIR/nonexistent.htm to /SOME_DIR...it should leave it alone!

Thanks for all your help on this.

jdMorgan

4:13 pm on Oct 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Alias or ScriptAlias directives can also do 'redirects' of a sort.

If you suspect your ErrorDocuments are being rewritten, add (a) RewriteCond(s) to exclude them.

The quickest way to master this subject is to experiment and make lots of mistakes, and get lots of 500-Server Error responses... and then fix them. :)

Once you've got it working, then a quick review to optimize the code is all that'll be needed.

Jim

stevewillis

6:07 pm on Oct 5, 2005 (gmt 0)



I think I just have to accept that this won't work. There are only the two standard "Alias" directives in my httpd.conf--one for icons, and one for error documents. I don't think I was correct about the error documents causing a 30X rewrite...there is not a separate HTTP request generated in the log, they are not custom, and in any case the rule excluding them didn't work. I think it is safe to say that there is no other code redirecting non-existent documents to 30X; my server does not have any ScriptAlias directives, does not allow any .htaccess files, and when my rules are removed, 404 errors are generated correctly.

Maybe you can suggest an alternative solution to the problem I originally intended to solve. I have a Web site at:

http://www.example.com/SOME_DIR/

This site exists entirely within SOME_DIR. The URL is printed in a book, so I can't move it around (I can't simply bump SOME_DIR's contents up to the root folder.)

Here is what I want to accomplish:

(1) There is no index page at http://www.example.com, so I want requests for that URL to be redirected permanently to http://www.example.com/SOME_DIR/

(2) People keep finding new ways to misspell SOME_DIR, or use mixed case letters (SoME_DiR), etc. I was using symbolic links to handle the most common misspellings, but I'd like to have any request that does not start with SOME_DIR to be redirected permanently.

(3) I need 404 errors to be correctly generated for resources that do not exist under SOME_DIR even after the solutions for #1 and #2 are applied.

I thought that the rules I proposed would solve my problem. They did solve #1 and #2, but created a problem for #3. Is there a better way to do this?

Thanks!

stevewillis

7:15 pm on Oct 5, 2005 (gmt 0)



I'll post my results in case someone else might be looking for the same solution. Here's what I ended up doing. I did move the entire site into the root directory, so now the site is accesible at:

http://www.example.com

Then I wrote a rewrite ruleset that handles the most common misspellings of the subdirectory and redirects the request to the root directory. Up to now, I've been using SOME_DIR in my posts, but I'll need to be more specific here so my rule makes sense. The subdirectory is named "CE06"...that's C-E-Zero-Six. People frequently misspell this as C-E-O-Six (letter Oh substituted for zero), or use lower case letters. Further, this URL was misprinted as CT06 at one source. Here's my rules:


RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteCond %{REQUEST_URI} ^/C[ET][0O]6(.*)$ [NC]
RewriteRule .* http://www.example.com%1 [R=301,L]

There are many negative aspects to this solution. My users are all typing in this URL from a printed book, so all of them have to undergo at least one redirection. Also, this solution only handles the most common misspellings of the subdirectory. But, it does work, and 404 errors aren't being mysteriously derailed by my rules, so I'll stick with it.

Thanks, Jim, for all your help!

jdMorgan

7:39 pm on Oct 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If these are all type-ins, and you're not trying to 'correct' old/incorrect search engine listings, then there's no reason to do an external redirect. Instead, internally rewrite the request to the proper URL-path:

RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteRule ^/C[ET][0O]6(.*)$ /$1 [NC,L]

With this approach, there is no 'penalty' except for the processing overhead of the rules themselves.

Even if you do keep the redirect, note that the pattern (and [NC] flag) can be moved into RewriteRule itself, eliminating the second RewriteCond.

Jim

stevewillis

7:48 pm on Oct 5, 2005 (gmt 0)



Wow, thanks, that is a good tip! I'm not trying to correct a search engine listing, and all the hits will be type-ins (the site is supplementary material for a published textbook.)

Best regards,

Steve