Forum Moderators: phranque

Message Too Old, No Replies

RewriteRule to eliminate recursive parameters

need to correct URLs ending in &lang=en&lang=en&lang=en

         

senoner

1:42 pm on Aug 19, 2011 (gmt 0)

10+ Year Member



The following rules:

RewriteEngine on
RewriteRule ^/(.+)lang=..&lang=(..)$ /$1lang=$2 [R=301]

should redirect URLs like the following:
/index.php?id=15&lang=de&lang=it&lang=de
/index.php?lang=en&lang=en&lang=en&lang=en
to simply:
/index.php?id=15&lang=de
/index.php?id=15&lang=en
always using the last lang-parameter and ignoring the other lang=xx

My Apache seems to ignore that rule. But Mod-Rewrite is installed correctls, as other rules for other virtual hosts on the same server work flawlessly

g1smd

7:34 pm on Aug 19, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteRule cannot see the query string. It sees only the path part of the URL request.

You need a preceding RewriteCond looking at QUERY_STRING in order to detect the parameter details.

senoner

9:56 am on Aug 20, 2011 (gmt 0)

10+ Year Member



Perfect! That's what I missed

The following rule works now:

RewriteEngine on
RewriteCond %{QUERY_STRING} ^(.*)lang=..&lang=(..)$
RewriteRule ^/(.*) /$1?%1lang=%2 [R=301]

Thank you very much!

g1smd

10:08 am on Aug 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Every RewriteRule needs the [L] flag.

Never use (.*) at the beginning of a pattern. It means "match the entire input".

senoner

2:24 pm on Aug 20, 2011 (gmt 0)

10+ Year Member



Really? I want to force the browser to issue a new http-request, and [R] works fine for this purpose.

I want to leave the path as-is, only the query-string should be rewritten, thus "match the entire input" is exactly what I need.

Up to now I didn't encounter any problems with the above rule. But I'm happy to learn about eventual side-effects.

Thanks

g1smd

6:09 pm on Aug 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



(.*)lang=..&lang=(..)$
The (.*) matches "everything" and then you confuse the RegEx parser by saying after "everything" there's some other stuff.

The parser then realises you didn't actually mean "everything" and it then has to do tens of thousands of "back off and retry" trial matches to find out what you actually wanted in there.

NEVER use (.*) at the beginning or in the middle of a pattern.

lucy24

9:45 pm on Aug 20, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As a matter of habit, include the [L] flag in all rules unless you have a clear and specific reason for omitting it. If you can execute Chains or Skips without batting an eye, you probably do not need this forum ;) If the rule winds up with [F], the [L] is redundant but does no harm.

The pattern ^(.*)blahblah means "there may or may not be other stuff before the blahblah". If you are not capturing the other stuff, you can achieve the same result, with much less work for your computer, by saying simply blahblah without anchor.

If you do need to capture the first part of the text, use a more specific wording so the computer doesn't also start capturing the "blahblah" before realizing it wasn't supposed to. You know that the specified text string only occurs once, but the computer doesn't.

It may help conceptually to think of the computer as working in one dimension. It can't see with its eyeballs that there's a single instance of "blahblah" at one place in the request; it has to go letter by letter until it hits "Whoops, that's the end of the text, I guess I was supposed to have done something with that blahblah I met a while back".

senoner

8:10 am on Aug 21, 2011 (gmt 0)

10+ Year Member



You're right about the [L], thanks - I've added it.

But I don't see another solution for the following goal:

/?lang=en&lang=en => /?lang=en
/index.php?id=15&lang=en&lang=it&lang=de => /index.php?id=15&lang=en&lang=de => /index.php?id=15&lang=de
/subdir/do.php?pid=22&gid=5&lang=fr&lang=it => /subdir/do.php?pid=22&gid=5&lang=it


There could be various subdirectories and various parameters in the query-string, but the lang-parameters are always last.

this rule seems to work well:

 RewriteEngine on
RewriteCond %{QUERY_STRING} ^(.*)lang=..&lang=(..)$
RewriteRule ^/(.*) /$1?%1lang=%2 [R=301,L]

Note: the last part "lang=..&lang=(..)" is of fixed length (15 chars). I don't know, if the regex-parser is intelligent enough to recognise this.
Please could you formulate a better (less cpu-heavy) regular expression?

lucy24

8:31 am on Aug 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The problem here is that your Condition + Rule only takes care of two lang=.. sets and it looks as if you might have long strings of them.

There's no perfect way to do it, but one approach is

^(.*?)(?:&lang=..)+(&lang=..)$

That's assuming apache recognizes the .*? syntax. It means, generically, "If you have a choice between several actions, take the stingiest one". Here it means "Don't include '&lang=..' in your first capture if you can dump it on someone else".

The ?: isn't essential. It means "don't capture this bit even though it's in parentheses". It's just so you don't have to remember to skip from %1 to %3. And since you've got that first &lang=.. in parentheses, you may as well do the same for the final set. It lets you drop the &lang= part from your Rule, for a savings of six bytes ;).

g1smd

8:34 am on Aug 21, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If it is always the last parameter, you could use
^([^&]+&)*lang=(..)$
which reads "read characters that are 'not &' one or more times, until you find '&', and then repeat that function zero or more times until you find the final '&', then read 'lang=' and then capture the two final characters" in the backreference.

Use a second RewriteCond to grab the rest of the query string that you want to re-use.

However you do it, the goal is to avoid using (.*) at the beginning or in the middle of the RegEx pattern.