Forum Moderators: phranque

Message Too Old, No Replies

mod_rewrite hell

replacing unlimited numbers of characters in a url

         

grebo444

4:22 pm on Jul 12, 2005 (gmt 0)

10+ Year Member



Hi,

Firstly, sorry for asking such dumb questions! I've tried, really tried but certain aspects of this are causing me major headaches and I'm sure there's a really simple answer!

I'm trying to write some rules which will pull out '_Q_', '_E_', and '_A_' from a url and replace them with '?', '=' and '&' respectively, without any fore-knowledge of how many times these patterns may appear in the URL. Here's what I have at the moment - all rewrite options are within a virtual host directive:

RewriteRule ^/(.*)_E_(.*) [sunworld-preview_s.rwa-net.co.uk...]
RewriteRule ^/(.*)_A_(.*) [sunworld-preview_s.rwa-net.co.uk...]
RewriteRule ^/(.*)_Q_(.*) [sunworld-preview_s.rwa-net.co.uk...]

These rules work fine, sort of and urls such as:

http://example.com_Q_param_E_56_A_param2_E_43

are successfully converted to the correct:

http://example.com?param=56&param2=43

The major problem is that Apache matches a single instance of each pattern and then redirects the url, matching another and so on - this results in about 20 requests for a single page! (Which I presume will drive search engines crazy!)

I have a few questions therefore:

1. If I write this rule (and most others I've written) without specifying the domain portion of the new url, the redirect fails - what are the criteria that set whether you need to specify a full url or just a path?

2. Is there any way I can get Apache to match these patterns multiple times, and therefore not have all these extra requests going round in a loop?

3. Assuming I use something like this:
RewriteRule ^([^_]*)_([^_]*)_([^_]*)_([^_]*)_(.*)$ http://www.example.com/$1-$2-$3-$4-$5

I presume that gets around my initial problem (partly, I'd still need three requests to match all three patterns) - would this be the way to go, and have multiple rules to match all possible numbers of the patterns?

I'd appreciate some advice from experts/experienced people as to which is the best way to go about this and how I can make life easier for myself :-)

Thanks in advance,
Matt

jdMorgan

5:30 pm on Jul 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Matt,

Welcome to WebmasterWorld!

Let me preface this by saying you might be far better off rewriting (not redirecting) these URL requests to a script, which could then replace the character sequences, output a 301 redirect response header, and so redirect the client.

The advantage is that most scripting languages would be far more efficient in doing multiple replacements -- See php preg_replace, for example.

That said, using mod_rewrite, your problem is in two levels. First you need to replace one or more instances of each character sequence, then see if there are any more character sequences that need to be replaced. Once this procedure is executed and no more character sequences are found that need to be replaced, then and only then do you send a redirect response to the client and give it the new URL. This is the general case, I'm not sure if, for example, you might have multiple instances of "_E_" in a request.

So the trick is to pass the information that a character sequence has been replaced on to the rule that does the redirect. This is required so that the redirect only happens if a character sequence has been replaced, and so that only one redirect is needed after all character sequences are replaced.

If you have only one instance, then something like the following should work:


# Replace characters, set redirect required flag if characters are replaced
RewriteRule ^/(.*)_E_(.*)$ /$1=$2 [E=redir:yes]
RewriteRule ^/(.*)_A_(.*)$ /$1&$2 [E=redir:yes]
RewriteRule ^/(.*)_Q_(.*)$ /$1?$2 [E=redir:yes]
# Redirect only if redir flag is set
RewriteCond %{ENV:redir} ^yes$
RewriteRule (.*) http://example.co.uk/$1 [R=301,L]

If you have multiple occurences of "_E_" or "_A_" then we'll have to discuss the use of RewriteRule's [N] flag. :)

The variable name "redir" is arbitrary; You can call it anything you like as long as there's not conflict with a system variable name or another user-defined variable name.

If you know anything about the requested URLs that can be used to limit the scope of these rules, you should add that information to the RewriteRules. For example, if only pages of type "xyz" are subject to needing this character sequence replacment treatment, then the pattern in the final rule should be "([^.]+\.xyx)". Similarly, you can scope the rules on subdirectory or filename, too. Basically, you want to avoid the ".*" pattern and prevent these rules from running on every HTTP request if possible.

Jim

grebo444

7:32 am on Jul 13, 2005 (gmt 0)

10+ Year Member



Hi Jim,

Thanks for that!

The only problem is the one you mention at the end, in that there will/may be more than one instance of any of these patterns in the URL.

I've tried using the [N] flag but this made no difference to the behaviour, but maybe that's because I need to use it in conjunction with this redirect flag?

Matt

grebo444

3:58 pm on Jul 13, 2005 (gmt 0)

10+ Year Member



Hi again Jim,

Sussed it! I'm using:

# Replace characters, set redirect required flag if characters are replaced
RewriteRule ^/(.*)_E_(.*)$ /$1=$2 [N,E=redir:yes]
RewriteRule ^/(.*)_A_(.*)$ /$1&$2 [N,E=redir:yes]
RewriteRule ^/(.*)_Q_(.*)$ /$1?$2 [N,E=redir:yes]
# Redirect only if redir flag is set
RewriteCond %{ENV:redir} ^yes$
RewriteRule (.*) http://example.co.uk/$1 [R=301,L]

Which does all the rewriting in one go and then sends a redirect to the client :-)

One remaining question - the response header contains the converted url, along with a 301 code. Is there any way to NOT send back the new URL, and have the client think they're still using the mangled URL? I'm thinking of this for search engines which don't like equals signs and stuff in URLs, so I want to keep them thinking they're using _E_ and not =.......

Thanks for all the help though - feel much better about it all now :-)

cheers,
Matt

jdMorgan

5:21 pm on Jul 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As an effciency improvement, I'd suggest:

# Replace characters, set redirect required flag if characters are replaced
RewriteRule ^/(.*)_E_(.*)$ /$1=$2 [E=redir:yes]
RewriteRule ^/(.*)_A_(.*)$ /$1&$2 [E=redir:yes]
RewriteRule ^/(.*)_Q_(.*)$ /$1?$2 [E=redir:yes]
# Restart if any more characters need to be replaced
RewriteRule _[EAQ]_ - [N]
# Redirect only if redir flag is set
RewriteCond %{ENV:redir} ^yes$
RewriteRule (.*) http://example.co.uk/$1 [R=301,L]

In order to do a purely-internal rewrite as opposed to an external redirect, simply remove the final RewriteRule and RewriteCond. You can also remove the [E=redir:yes] since that's no longer needed.

Note that any code that uses [N] should be placed as close to the beginning of your .htaccess file as possible without affecting the operation of other directives or rules; When invoked by the [N] flag, mod_rewrite must re-parse your .htaccess file from the beginning, and re-execute any applicable rules. So, putting the code close to the beginning eliminates wasted CPU effort. If you have a known, small number of characters that need to be replaced, and the code acnnot be placed near to the beginning of the file, it is often more efficient to just duplicate the rules, rather than using the [N] flag.

Jim

grebo444

10:11 am on Jul 14, 2005 (gmt 0)

10+ Year Member



Hi,

I tried removing the final rewriterule, but this stopped it working completely! :-)

To elaborate, we have Resin serving these dynamic pages, running behind Apache. What seems to be happening, is that Apache is correctly rewriting the URL (the logs show this working ok), but Resin is still attempting to process the mangled URL, not the new URL that Apache has rewritten.

So, I think we're in new territory here with an incompatibility between mod_rewrite and Resin....Imma go Googling :-)

But again, thanks for the help.

Matt

jdMorgan

2:09 pm on Jul 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't know anything about Resin, but if Resin is loaded as a module, make sure it is loade *before* mod_rewrite; mod_rewrite won't execute for any file request that is handled by modules loaded after it (On Apache 1.x, execution order and priority are the reverse of the LoadModule list order).

Similarly, if Resin uses a ScriptAlias in httpd.conf, control may be diverted to it before mod_rewrite can run. And finally, if neither of these is true, then it may be necessary to use the [PT] flag in RewriteRules to pass rewritten URLs to Resin.

Just some ideas...

Jim