Forum Moderators: phranque

Message Too Old, No Replies

black magic and subdirectories

         

launcelot

7:39 am on Jan 4, 2008 (gmt 0)

10+ Year Member



One question, probably not so simple: how exactly do rewrite rules extend to subdirectories of the directory where they are applied, in the case where the regular expression(s) used is of such nature that it is triggered by an URL appealing to any of said subdirectories?

I've found this in the apache 2.0 doc:

Note that, by default, rewrite configurations are not inherited. This means that you need to have a RewriteEngine on directive for each virtual host in which you wish to use it.

Quite vague, and does not say what happens to subdirectories.

Here's food for thought. I'm just coming to share a couple of interesting phenomena that I witnessed with you, and maybe get a couple of clues, I don't know if someone can solve my problem front-headed because what I'm doing is quite complicated and intricate to begin with. See the rewrite rules that I have put between directory flags in my document root:

RewriteEngine on
RewriteRule ^constructeur.php$ constructeur.php [L]
RewriteRule ^(.+)/$ $1
RewriteRule ^.+/(.+)/(.+)\.gif$ $1/$2.gif [L]
RewriteCond %{QUERY_STRING} "^(.+)"
RewriteRule ^(.+)\..+$ constructeur.php?build=$1&%1 [L]
RewriteCond %{QUERY_STRING} "^(.+)"
RewriteRule ^(.+)$ constructeur.php?build=$1&%1 [L]
RewriteRule ^(.+)\..+$ constructeur.php?build=$1 [L]
RewriteRule ^(.+)$ constructeur.php?build=$1 [L]
RewriteRule ^constructeur.php$ constructeur.php [L]

The two ones that are preceeded by a rewritecond are the core of the concept, the others are there to counter various undesirable effects, with more or less success. And let me just mention in case it matters, I'm working directly on the conf file, not on a .htaccess.

It works about as expected and intended, except in some cases, namely when the URL requested by the client is an actual subdirectory of my documentroot folder, without a slash at the end or any other extra text. Say I have a subdirectory named abcd at my document root. In this case, requesting for the url www.mydomain.com/abcd will result in the url being modified in the navigator text line when requesting the page! Never supposed to happen! For instance, in this case, typing www.mydomain.com/abcd in the navigator will result in the redirects happening, the page "constructeur.php" being loaded... and some text begin added to the URL adress in the navigator window, which ends up as www.mydomain.com/abcd/?build=abcd

On the other hand, if I originally type a slash behind abdc, things happen normally, just as you would expect. And if I type, for example, www.mydomain.com/abcd/foo.html?x=666, I end up requesting the URL constructeur.php, and the GET method sends the variables build=abcd/foo and x=666 to my constructeur.php script. This is the intended effect, and for the most part it is what I want (some of you might realize at this stage that if I have a script named foo.php in a subfolder called abcd of my document root that uses a variable x, I can call it dynamically through an include in my script constructeur.php, and in addition to that I can use the string I have stored in the variable build for some extra dynamic effects of the constructeur.php file; this is the basic idea behind this system; or, say, I can also call tellmeastory/wishmemybirthday/abcd/foo.php).

Except that first I also do want to have subdirectories that are not subject to the above set of rewrite rules, and second I want to understand what's happening. So I've more or less gone trial and error, and I have discovered that I can protect a subdirectory by adding a rewrite rule into it. At first I would be including rules such as ^(.+)$ _ , but then I noticed that in fact such rules of the subdirectory never appeared in my rewrite log (I understood the reason later thanks to the clue in the 2.0 doc, see at the end of paragraph). In fact, and this is where things get real blackmagicky, the rule itself is unimportant, all that matters is that there is a rule... in other words, you can put any rule you want, even if the rule is not used by the server, it just protects your subdirectory from being recruited by the rewrite rules of the document root. As I'm writing this, I'm using the following rule to protect one of my subdirectories: RewriteRule ^qsdsfds$ qqsdqf. And in fact, your subdirectory is even protected if its rewrite rule is inactive due to RewriteEngine on not being set.

It also works if at some point the subdirectory is reached by a request after an internal redirect, which is of great use in conjunction with my second rule (the one for the .gif files), but I have no time to bore you guys with too much detail (especially since I'm already boring you quite a bit and it's getting quite a bit late in France).

So, okay, I seem to understand the RewriteEngine directive is not inherited, and that RewriteRule is always inherited if there is no RewriteRule directive in the child directory... What happens when there are rewrites in both the parent and the child is unclear to me. I 've also come to see some strange behaviors with "Pass-thgough" written in the rewrite log where both the rules of the parent and child directory were applied, though I don't manage to reproduce them now. What about other mod_rewrite directives? Other apache directives which work like this? I know Allow and Deny do not work this way, I've tried and it appeared that an Allow in a child directory does not cancel an Allow in a parent. The Allow in the parent will only be neutered by a Deny or Order which contradicts it. Is there a place where all this override and inherit behavior is explained in detail for all directives?

When there are subsubdirectories, strange and new interplays seem to be happening. Or maybe I'm just too tired, because right now I just don't manage reproducing them neither. I'm nearly scared of what's going to happen when I'll throw virtual hosts into the equation (because I want to).

[edited by: launcelot at 7:48 am (utc) on Jan. 4, 2008]

launcelot

7:40 am on Jan 4, 2008 (gmt 0)

10+ Year Member



Sorry about the line breaks. Just a silly mistale on my share...

Fiwed it.

jdMorgan

3:20 pm on Jan 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Say I have a subdirectory named abcd at my document root

OK, that will invoke only these two rules:


RewriteRule ^(.+)$ constructeur.php?build=$1 [L]
RewriteRule ^constructeur.php$ constructeur.php [L]

The request will first be rewritten to constructeur.php?build=abcd by the first rule shown above, and mod_rewrite processing will stop. However, because the URL-path has been modified, the server will re-invoke mod_rewrite and several other modules so that access controls on the new URL-path can be checked.

Contrary to popular belief, in .htaccess specifically, the [L] flag does not terminate mod_rewrite processing for the current HTTP request. It simply terminates the current mod_rewrite parser pass through the .htaccess file. If any URL-path changes have been made, then parsing will be restarted as described, until no more URL-path changes are detected.

When mod_rewrite is reinvoked, the second rule shown above will be invoked, and will rewrite constructuer.php to itself, leaving any query string as it was. Therefore, this rule does nothing, except to create an 'infinite' rewriting loop! This is probably not what you intended to do here. However, because this rule does not change the URL-path, I cannot tell what was intended, and so cannot suggest any improvement.

In order to prevent problems caused by the first rule shown here, you will have to decide how to tell the server which URL-paths should be rewritten by this first rule, and which are slashless directory paths which should not be rewritten. How is the server to know the difference? There are several ways, and I'll list them in most-efficient to least-efficient order:

Using RewriteCond:
Detect URL-paths with extensions such as ".gif" These are not slashless directories, so do not rewrite.
Do a "-d" check on %{REQUEST_FILENAME}. If URL-path resolves to a directory, do not rewrite it.
Create a list of all paths which should not be rewritten by this rule. This may create a long-term maintenance problem, if the list changes.

In general, the first method is far more efficient than the other two. File-exists checks require a call to the file management system --and possibly may even require a disk read-- making this method CPU-intensive. The last method, as described, can become a maintenance nightmare. My advice? If it is a directory, link to it with a slash. If there is no slash, expect it to be treated as a file. Write your rules accordingly.

Jim

launcelot

7:18 pm on Jan 4, 2008 (gmt 0)

10+ Year Member



Thanks for the quick answer. I will probably use a mix between your first and last method.

I cannot flat-out prevent all replacement of .gif files. Or at least, not yet. Why? In the last resort, this rewrite strategy of mine is intended to let humans know that they have a dynamic site to which they can issue various commands, while persuading robots that they are simply surfing on a good old, static site.

You see, say my page contains a link to some image "foo.gif" which is located in the "images" folder. In constructeur.php the path of the link is, of course, relative. Now, say a user requests the page "wishmemybirthday/tellmeastory," and foo.gif needs to be loaded on the page: the image URI will be wishmemybirthday/tellmeastory/images/foo.gif. I need this address rewritten or the image won't display. That's where my second rule kicks in:

RewriteRule ^.+/(.+)/(.+)\.gif$ $1/$2.gif

After this one has been applied, then I can order apache to not rewrite .gif files anymore, like you suggest. But maybe I should just simply do like you suggest, and leave the job of removing the unwanted parts of the URI path to the php script... Yes, that's probably better, especially since I'm already doing this for my hypertext links, there's nothing against doing it just the same for URIs.

Tonight I will try to get rid of this annoying appending of the dynamic part of the rewritten URL at the end of the requested URL in the navigator window by issuing a DirectorySlash off in the directory where my set of rewrite rules happens. I really think this is the core of the problem. I will reactivate the directive in those subdirectories which will not be rewritten. I will add the slash "manually" with a rewrite rule for the case where the subdirectory itself is requested at the document root.

This seems to lead us naturallly to the maintenance nightmare you mention. Well... my server will never become so huge, after all it's only me, my home computer and my personal Internet connection, plus maybe some little places for online and/or IRL friends. And I can keep the number of subdirectories low, fix the list of subdirectories once and for all, decide which ones will be and which ones will not be rewritten (as well as their contents), and then rely on subsubdirectories which will be protected (or not) against the rewrite by virtue of being children on the adequate subdirectory. The idea is that directories containing php files for includes get rewritten, while directories holding content which is not meant to be displayed as a part of the main site page are not rewritten, at least not before their directory or .htaccess file gets its turn to play in the little game of shortest match goes first.

BTW, you have probably noticed that my last rule is absolutely useless (at least I think so). Oh well, this whole thing is still in the course of being developed.

QUOTE > File-exists checks require a call to the file management system --and possibly may even require a disk read-- making this method CPU-intensive.

A reasonnable but imperfect solution would be to end directory names with a given string, like, say, "-dir." The image directory would be images-dir. And then something like:

^(.+)-dir _ [L]

I would prefer however not to use this trick if I can avoid, it hurts URL aesthetics. I'll keep you up with my progress.

jdMorgan

8:21 pm on Jan 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> A reasonnable but imperfect solution would be to end directory names with a given string, like, say, "-dir." The image directory would be images-dir. And then something like:

A more reasonable solution --and one that complies with best practices and the HTTP specification-- is to end directory paths with a slash, and files with no slash... :) I believe you will find that it will save you a lot of unnecessary work. This takes no more effort than simply using consistent links on your pages.

Image files (and others) that are centrally-located and 'shared' can be 'normalized' from any subdirectory to a common directory path with something like this:


RewriteCond $1 !^images/$
RewriteRule ^([^/]*/)*([^./]+)\.gif$ /images/$2.gif [L]

$1 'consumes' any leading path information, while $2 contains no slashes or periods (full stops) and therefore will contain only the filename. The RewriteCond prevents a potential 'infinite loop.'

Jim

launcelot

11:56 am on Jan 5, 2008 (gmt 0)

10+ Year Member



After testing it appears that DirectorySlash Off does remove my problem of www.mydomain.com/abcd being rewrittent to www.mydomain.com/abcd/?build=abcd.

QUOTE: >
RewriteCond $1!^images/$
RewriteRule ^([^/]*/)*([^./]+)\.gif$ /images/$2.gif [L]

> $1 'consumes' any leading path information, while $2 contains no slashes or periods (full stops) and therefore will contain only the filename. The RewriteCond prevents a potential 'infinite loop.'

That's reaaly interesting, thanks, albeit not exatly usable as is in my setting. I will certainly use an dapatation of this regular expression.

launcelot

11:59 am on Jan 5, 2008 (gmt 0)

10+ Year Member



QUOTE: > A more reasonable solution --and one that complies with best practices and the HTTP specification-- is to end directory paths with a slash, and files with no slash... :) I believe you will find that it will save you a lot of unnecessary work. This takes no more effort than simply using consistent links on your pages.

I'm not sure I understand what you mean. When designing a web page, I add a slash at the end of directory paths, and no slash at the end of a file path (of course).

But it's not possible to control what visitors do when they type an adress into their browser.