Weird thing with a rule - Apache Web Server forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Weird thing with a rule

moroandrea

11:02 am on Feb 12, 2013 (gmt 0)

10+ Year Member

Hello folks,

I'm puzzled about a weird scenario I'm trying to get sorted out.

This is the incoming URL:

[misite.co.uk...]

and these are two (of the many rules) in the .htaccess file

RewriteCond %{QUERY_STRING} id=37
RewriteCond %{QUERY_STRING} view=article
RewriteRule .* this[R=301,L]

RewriteCond %{QUERY_STRING} id=36
RewriteCond %{QUERY_STRING} view=article
RewriteRule .* that[R=301,L]

RewriteCond %{QUERY_STRING} id=3
RewriteCond %{QUERY_STRING} view=article
RewriteRule .* those [R=301,L]

Rules are in this exact order, although I do suspect the problem is not this.

By querying the URL above, the THAT rule is executed and this doesn't make sense to me. Assuming a partial match is the cause of the problem, there is another id=3x rule above and that should in theory be executed first.

However, by adding an & at the
RewriteCond %{QUERY_STRING} id=36&

Everything works fine.

Aren't the Querystring paramenters parsed by the apache module and assessed in a key=value matching way?

Many thanks for your help.

Andrea

moroandrea

11:23 am on Feb 12, 2013 (gmt 0)

10+ Year Member

Ah well, so to make it clear, I'd like to understand:

- why I need to add the & to get everything working
- whether or not the querystring condition checks for the key=value only (although this doesn't appear to be the case ad the & is not stripped out
- why the although the rules above, the condition that matched was the second (that) and not the first (this)

phranque

1:35 pm on Feb 12, 2013 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

it's trying to match a perl compatible regular expression so it make perfect sense.

it would probably be better for your solution to prefix the regular expression with the ampersand:


RewriteCond %{QUERY_STRING} &id=36

moroandrea

2:08 pm on Feb 12, 2013 (gmt 0)

10+ Year Member

Hi Phranque,

thanks for your rpely. Can you please try to elaborate a bit more on this "perl compatible" and perhaps similar dialect of other languages so I can fully understand the situation?

Thanks
Andrea

phranque

2:21 am on Feb 13, 2013 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

the "perl compatible" part is practically irrelevant in this specific instance.
concentrate on the "match a regular expression" concept.

lucy24

2:27 am on Feb 13, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

overlapping phranque

?option=com_content&view=article&id=3&Itemid=36

Woo Hoo!

Only yesterday (really) we were talking about the need for anchors around Conditions looking at query strings, and you have provided a flawless example

:)

The string "id" is contained within the string "Itemid"
The string "3" is contained within the string "36"
The string "id=3" is contained within the string "Itemid=36"

It is probably too late to change the names of your parameters, although that would be the best solution. Instead what you need is anchoring like this

... %{QUERY_STRING} (^|&)id=3($|&)

where (^|&) means "this is either the very beginning of the whole query string or the beginning of an individual query" and ($|&) means the same thing, replacing "beginning" with "end". In each case there are two pipe-separated options. You have to use the (a|b) construction instead of the simpler [ab] because in each case one option is a literal string (the & piece) and the other is an anchor (the ^ and $ pieces).

Do you see how that works?

phranque

6:07 am on Feb 13, 2013 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

It is probably too late to change the names of your parameters, although that would be the best solution.

those are joomla urls, so...

moroandrea

8:07 am on Feb 13, 2013 (gmt 0)

10+ Year Member

@ucy24

Thanks for clarifying this to me. As I said I thought the regex module was performing a "split" of the querystring under the bonnet, leaving the condition the task to check only for the key and value values.

Now that I now it treats everything as a string, everything is much clearer.

Thanks for the regex quirk. I will implement straight away.

Best
Andrea

g1smd

8:52 am on Feb 13, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

It's not a quirk. It's designed to work that way.

Proper "anchoring" in the RegEx pattern is key to making the rule work the way you want it to.

Depending on what "that" is, there is a possibility of an endless redirect loop appearing for some of your redirects.

You may need to look at THE_REQUEST to make sure you redirect only direct client requests and not redirect previously internally rewritten requests.

moroandrea

8:58 am on Feb 13, 2013 (gmt 0)

10+ Year Member

@g1smd thanks for your bit.
Can you please show me an example on how to implement these checks?

Thanks

g1smd

9:24 am on Feb 13, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Roughly one in every five of the previous 87 000 threads in the Apache sub-forum uses THE_REQUEST in some way or other. Have a look through a few of those, and the Apache documentation, and see what you can come up with.

moroandrea

9:44 am on Feb 13, 2013 (gmt 0)

10+ Year Member

Ok, so this is interesting.

"By testing THE_REQUEST using a RewriteCond, you prevent the redirect from being invoked as a result of a previously-rewritten request"

So, wouldn't be in theory more useful to always perform testing on the THE_REQUEST?

g1smd

10:07 am on Feb 13, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You only need to test THE_REQUEST when you are looking to redirect a request. There is one particular scenario to look out for. This is where you redirect a request with parameters to a "friendly" URL and that request is then rewritten internally back to a form with parameters.

If you are not testing THE_REQUEST in your redirecting rule, once the internal rewrite has occurred in a later rule, htaccess is parsed again and this means the request will match the redirecting rule and be redirected again. This exposes the recently rewritten path back out on to the web as a new URL. That request is then likely to match your parameter-to-friendly redirecting rule again, and you now have an infinite loop.

Using THE_REQUEST does make for slighly more complicated code. THE_REQUEST is the literal:

GET /thispage?some=parameters HTTP/1.1

request sent by the browser. The RegEx pattern needed to match is obviously a bit more complex.

moroandrea

10:34 am on Feb 13, 2013 (gmt 0)

10+ Year Member

@g1smd
thanks, but assuming the original request was having the querystring and the rewritten URLs have not, looking at THE_REQUEST or QUERY_STRING is essentially the same (apart from the bit that will be analysed)

Would THE_REQUEST preserve the original path moving forward to the redirects?

g1smd

12:03 pm on Feb 13, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Consider this simple scenario:

RewriteCond %{QUERY_STRING} ^digits=([0-9]+)&letters=([a-z]+)$
RewriteRule ^(index\.php)?$ http://www.example.com/%1-%2? [R=301,L]

RewriteRule ^([0-9+])-([a-z]+) /index.php?digits=$1&letters=$2 [L]

The first rule redirects requests for

example.com/index.php?digits=123&letters=abc

or for

example.com/?digits=123&letters=abc

to

www.example.com/123-abc

The browser then requests

www.example.com/123-abc

which is internally rewritten by the second rule to

/index.php?digits=123&letters=abc

This internal request should then invoke the index.php file, pass the parameters to it and the PHP should then deliver the page of HTML and content.

Unfortunately, the internally rewritten pointer now matches the pattern in the redirecting rule and

www.example.com/index.php?digits=123&letters=abc

is exposed back out on to the web as a URL and the user is redirected again in a loop. The PHP file never gets invoked.

The redirecting rule should test that THE_REQUEST contained query string parameters. This stops the rule being invoked when the internal pointer has parameters as a result of a previous internal rewrite. In that case, the requested URL was

www.example.com/123-abc

without parameters.

The pattern for matching

THE_REQUEST

is necessarily more complex. It usually begins

^[A-Z]{3,9}\ /

and often ends

\ HTTP/

with various other stuff in the middle to match optional index.php, e.g.

/(index\.php)?

followed by parameters. Rather than use

\?digits=[0-9]+&letters=[a-z]+

here, the parameters part can often be generalised to

\?[^\ ]+

or similar.

REQUEST_URI

is modified as a result of internal rewrites,

THE_REQUEST

is not.

lucy24

1:54 pm on Feb 13, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Would THE_REQUEST preserve the original path moving forward to the redirects?

You may need to get a grip on one more thing.

The "Request" isn't what the human user asked for by clicking a link, selecting a bookmark or typing in a name. It's what the browser asked for. Hence the term "User Agent". It's doing things on your behalf.

Every time you go to a www page, the browser takes a quick look at the html, makes a shopping list of images and css files, and asks the server to send them along, one by one. In the server logs, each of those will look exactly the same as the original page request.

If the original human request runs into a redirect, your browser doesn't come back and say "Oops, we've got a problem, do you want to try the side door?" It just puts in a new request and your address bar magically changes. But, again, the site's logs will show two separate requests.

Each request is an island. If you get redirected and the browser puts in a fresh request and you're back at the same site a millisecond later, the server will act as if it has never seen you before in its life. That's why you sometimes get a browser error telling you that a redirect is going around in circles so it's going to pull the plug. The server can't tell; the browser has to step in and rescue you.

An internal rewrite is different, because now there are two things. One is what the browser asked for; the other is what the server is giving it. Obvious example: You ask for example.com/index.html, you get redirected to example.com/ alone (assuming it's been coded properly) and end up seeing a page... whose filename happens to be index.html.

Behind the scenes, mod_rewrite has processed two separate requests for "index.html". One gets redirected; the other gets waved on through. The difference is the %{THE_REQUEST} line, which essentially tells the server "It is OK to hand over this file, but only if the visitor didn't ask for it." Which may sound a bit perverse, but there you are.

In some cases, you can avoid looking at %{THE_REQUEST} by using the [NS] flag instead. This means essentially the same thing: "The browser didn't ask for this file, the server did." In my case, f'rinstance, almost everything in .php gets the [NS]: the server is welcome to use the files, but humans should keep their grubby hands off ;)

phranque

1:56 pm on Feb 13, 2013 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Would THE_REQUEST preserve the original path moving forward to the redirects?

if by "moving forward to the redirects" you mean the subsequent request after the 301 response, the answer is no, that's a new request so THE_REQUEST will contain the new request.

moroandrea

2:48 pm on Feb 13, 2013 (gmt 0)

10+ Year Member

@All

thanks for your time and efforts to get this explained to me. Very appreciated, though I still have some doubts.

@lucy24

The difference is the %{THE_REQUEST} line, which essentially tells the server "It is OK to hand over this file, but only if the visitor didn't ask for it."

Not sure I get this. If {THE_REQUEST} is processed all the time, what could be the benefit?

I'm perplexed because reading further, in the next paragraph you mention [NS] to be the same of {THE_REQUEST}.

So should I consider the {THE_REQUEST} and [NS] useful or perhaps ok to be used only when I'm treating everything that has not been:
a) typed in the URL address bar by a user
b) a link clicked from somewhere

According to @phranque, {THE_REQUEST} will always contain the new request, hence if I have a URL rewritten from a.html to b.html, the second {THE_REQUEST} should be something like GET b.html HTTP etc. etc.

I believe that if a working example can be made on top of what @g1smd did, I should probably be able to understand the scenario.

Thanks for your patience guys.

g1smd

2:57 pm on Feb 13, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

if I have a URL rewritten from a.html to b.html, the second {THE_REQUEST} should be something like GET b.html HTTP

I think you are mixing up redirects and rewrites.

If a URL request is redirected to a new URL, there will be a second request for that new URL. THE_REQUEST will be different each time.

If a URL request is rewritten, the server tries to serve the content from the rewritten internal location - unless another rule accidentally matches and does something unexpected. THE_REQUEST will contain the originally requested URL.

moroandrea

3:19 pm on Feb 13, 2013 (gmt 0)

10+ Year Member

@g1smd yes possibily I did get confused by redrects and rewrites in this particular instance.

In any case rules will be executed from top to the bottom of the file. So what I need to pay attention to is what I am asking and avoid clashes.

As usual there is not just one route to achieve the goal.

Now going back to your answer.

If a URL request is rewritten, the server tries to serve the content from the rewritten internal location

Does this mean that rewritten URL because they are not making a new request, they will keep processing the rules without the .htaccess to be parsed again from the beginning (unless the [L] won't stop processing)?

lucy24

10:35 pm on Feb 13, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

So should I consider the {THE_REQUEST} and [NS] useful or perhaps ok to be used only when I'm treating everything that has not been:
a) typed in the URL address bar by a user
b) a link clicked from somewhere

NO. The server doesn't know whether the request came from a human or from the human's browser. To reiterate: If something is redirected, the human has not taken any new action. But their browser (or other User Agent) has-- and that action results in a whole new request. And, unless you've put some special extra business in the redirect target* there is no way to know that this second request was prompted by a redirect.

That's why I used the index.html example. The human starts out by typing "index.html" and ends up seeing the page whose name happens to be "index.html" BUT the page display is not a response to what the human user typed.

* Like, say,

RewriteCond %{THE_REQUEST} index
RewriteRule ^index\.html http://www.example.com/?redir=1 [R=301,L]

or

RewriteCond %{THE_REQUEST} index
RewriteRule ^index\.html http://www.example.com/ [R=301,L,CO=redir:1:.example.com]

g1smd

11:23 pm on Feb 13, 2013 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Does this mean that rewritten URL because they are not making a new request, they will keep processing the rules without the .htaccess to be parsed again from the beginning (unless the [L] won't stop processing)?

URLs don't get rewritten. URL requests get rewritten - or redirected - depending on your htaccess rules.

Let's be sure you fully understand the process.

htaccess doesn't make URLs for content. htaccess acts on URL requests after a link on a page is clicked or after a URL is typed into the browser address bar or a browser bookmark is selected.

Each Apache module parses the htaccess file in turn. Each module ignores rules belonging to other modules. Each module parses the rules in the order they are presented in the file. Once a rule has been invoked, the htaccess file is parsed again from the beginning to make sure no other rules match this request.

URLs are used "out there" on the web. Filepaths and files are used "here" inside the server. They are two different reference systems and are related merely by your server configuration.

You might have a file called /index.html stored on the server hard drive.

You might be able to access that file by requesting the URL www.example.com/index.html

However, it's much better if you use the URL www.example.com/ instead. When you do that, an inbuilt default internal rewrite that associates the URL path "/" with the internal file "/index.html" kicks in.

This raises a problem. You have now two URLs that can return the same content - duplicate content.

You fix this with a redirect. Request the URL example.com/index.html or www.example.com/index.html and the server now returns 301 status and informs the browser to make a new request for the URL www.example.com/

The browser makes the new URL request for www.example.com/ and the internal rewrite serves the content from the /index.html file - without letting on where inside the server that content comes from.

The redirecting rule MUST test that THE_REQUEST contained "index.html" to avoid the infinte loop.

The REDIRECT

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html$ http://www.example.com/$1 [R=301,L]

The REWRITE

RewiteRule ^$ /index.html [L]

The final rule is actually implemented using

DirectoryIndex index.html

in real life.