homepage Welcome to WebmasterWorld Guest from 54.163.139.36
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Weird thing with a rule
moroandrea



 
Msg#: 4544878 posted 11:02 am on Feb 12, 2013 (gmt 0)

Hello folks,

I'm puzzled about a weird scenario I'm trying to get sorted out.

This is the incoming URL:

[misite.co.uk...]

and these are two (of the many rules) in the .htaccess file

RewriteCond %{QUERY_STRING} id=37
RewriteCond %{QUERY_STRING} view=article
RewriteRule .* this[R=301,L]

RewriteCond %{QUERY_STRING} id=36
RewriteCond %{QUERY_STRING} view=article
RewriteRule .* that[R=301,L]

RewriteCond %{QUERY_STRING} id=3
RewriteCond %{QUERY_STRING} view=article
RewriteRule .* those [R=301,L]

Rules are in this exact order, although I do suspect the problem is not this.

By querying the URL above, the THAT rule is executed and this doesn't make sense to me. Assuming a partial match is the cause of the problem, there is another id=3x rule above and that should in theory be executed first.

However, by adding an & at the
RewriteCond %{QUERY_STRING} id=36&

Everything works fine.

Aren't the Querystring paramenters parsed by the apache module and assessed in a key=value matching way?

Many thanks for your help.

Andrea

 

moroandrea



 
Msg#: 4544878 posted 11:23 am on Feb 12, 2013 (gmt 0)

Ah well, so to make it clear, I'd like to understand:

- why I need to add the & to get everything working
- whether or not the querystring condition checks for the key=value only (although this doesn't appear to be the case ad the & is not stripped out
- why the although the rules above, the condition that matched was the second (that) and not the first (this)

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4544878 posted 1:35 pm on Feb 12, 2013 (gmt 0)

it's trying to match a perl compatible regular expression so it make perfect sense.

it would probably be better for your solution to prefix the regular expression with the ampersand:


RewriteCond %{QUERY_STRING} &id=36

moroandrea



 
Msg#: 4544878 posted 2:08 pm on Feb 12, 2013 (gmt 0)

Hi Phranque,

thanks for your rpely. Can you please try to elaborate a bit more on this "perl compatible" and perhaps similar dialect of other languages so I can fully understand the situation?

Thanks
Andrea

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4544878 posted 2:21 am on Feb 13, 2013 (gmt 0)

the "perl compatible" part is practically irrelevant in this specific instance.
concentrate on the "match a regular expression" concept.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4544878 posted 2:27 am on Feb 13, 2013 (gmt 0)

overlapping phranque
?option=com_content&view=article&id=3&Itemid=36

Woo Hoo!

Only yesterday (really) we were talking about the need for anchors around Conditions looking at query strings, and you have provided a flawless example

:)

The string "id" is contained within the string "Itemid"
The string "3" is contained within the string "36"
The string "id=3" is contained within the string "Itemid=36"

It is probably too late to change the names of your parameters, although that would be the best solution. Instead what you need is anchoring like this

... %{QUERY_STRING} (^|&)id=3($|&)

where (^|&) means "this is either the very beginning of the whole query string or the beginning of an individual query" and ($|&) means the same thing, replacing "beginning" with "end". In each case there are two pipe-separated options. You have to use the (a|b) construction instead of the simpler [ab] because in each case one option is a literal string (the & piece) and the other is an anchor (the ^ and $ pieces).

Do you see how that works?

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4544878 posted 6:07 am on Feb 13, 2013 (gmt 0)

It is probably too late to change the names of your parameters, although that would be the best solution.


those are joomla urls, so...

moroandrea



 
Msg#: 4544878 posted 8:07 am on Feb 13, 2013 (gmt 0)

@ucy24

Thanks for clarifying this to me. As I said I thought the regex module was performing a "split" of the querystring under the bonnet, leaving the condition the task to check only for the key and value values.

Now that I now it treats everything as a string, everything is much clearer.

Thanks for the regex quirk. I will implement straight away.

Best
Andrea

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4544878 posted 8:52 am on Feb 13, 2013 (gmt 0)

It's not a quirk. It's designed to work that way.

Proper "anchoring" in the RegEx pattern is key to making the rule work the way you want it to.

Depending on what "that" is, there is a possibility of an endless redirect loop appearing for some of your redirects.

You may need to look at THE_REQUEST to make sure you redirect only direct client requests and not redirect previously internally rewritten requests.

moroandrea



 
Msg#: 4544878 posted 8:58 am on Feb 13, 2013 (gmt 0)

@g1smd thanks for your bit.
Can you please show me an example on how to implement these checks?

Thanks

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4544878 posted 9:24 am on Feb 13, 2013 (gmt 0)

Roughly one in every five of the previous 87 000 threads in the Apache sub-forum uses THE_REQUEST in some way or other. Have a look through a few of those, and the Apache documentation, and see what you can come up with.

moroandrea



 
Msg#: 4544878 posted 9:44 am on Feb 13, 2013 (gmt 0)

Ok, so this is interesting.

"By testing THE_REQUEST using a RewriteCond, you prevent the redirect from being invoked as a result of a previously-rewritten request"

So, wouldn't be in theory more useful to always perform testing on the THE_REQUEST?

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4544878 posted 10:07 am on Feb 13, 2013 (gmt 0)

You only need to test THE_REQUEST when you are looking to redirect a request. There is one particular scenario to look out for. This is where you redirect a request with parameters to a "friendly" URL and that request is then rewritten internally back to a form with parameters.

If you are not testing THE_REQUEST in your redirecting rule, once the internal rewrite has occurred in a later rule, htaccess is parsed again and this means the request will match the redirecting rule and be redirected again. This exposes the recently rewritten path back out on to the web as a new URL. That request is then likely to match your parameter-to-friendly redirecting rule again, and you now have an infinite loop.

Using THE_REQUEST does make for slighly more complicated code. THE_REQUEST is the literal:
GET /thispage?some=parameters HTTP/1.1
request sent by the browser. The RegEx pattern needed to match is obviously a bit more complex.

moroandrea



 
Msg#: 4544878 posted 10:34 am on Feb 13, 2013 (gmt 0)

@g1smd
thanks, but assuming the original request was having the querystring and the rewritten URLs have not, looking at THE_REQUEST or QUERY_STRING is essentially the same (apart from the bit that will be analysed)

Would THE_REQUEST preserve the original path moving forward to the redirects?

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4544878 posted 12:03 pm on Feb 13, 2013 (gmt 0)

Consider this simple scenario:

RewriteCond %{QUERY_STRING} ^digits=([0-9]+)&letters=([a-z]+)$
RewriteRule ^(index\.php)?$ http://www.example.com/%1-%2? [R=301,L]


RewriteRule ^([0-9+])-([a-z]+) /index.php?digits=$1&letters=$2 [L]

The first rule redirects requests for
example.com/index.php?digits=123&letters=abc or for example.com/?digits=123&letters=abc to www.example.com/123-abc

The browser then requests
www.example.com/123-abc which is internally rewritten by the second rule to /index.php?digits=123&letters=abc

This internal request should then invoke the index.php file, pass the parameters to it and the PHP should then deliver the page of HTML and content.

Unfortunately, the internally rewritten pointer now matches the pattern in the redirecting rule and
www.example.com/index.php?digits=123&letters=abc is exposed back out on to the web as a URL and the user is redirected again in a loop. The PHP file never gets invoked.

The redirecting rule should test that THE_REQUEST contained query string parameters. This stops the rule being invoked when the internal pointer has parameters as a result of a previous internal rewrite. In that case, the requested URL was
www.example.com/123-abc without parameters.

The pattern for matching
THE_REQUEST is necessarily more complex. It usually begins ^[A-Z]{3,9}\ / and often ends \ HTTP/ with various other stuff in the middle to match optional index.php, e.g. /(index\.php)? followed by parameters. Rather than use \?digits=[0-9]+&letters=[a-z]+ here, the parameters part can often be generalised to \?[^\ ]+ or similar.

REQUEST_URI is modified as a result of internal rewrites, THE_REQUEST is not.
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4544878 posted 1:54 pm on Feb 13, 2013 (gmt 0)

Would THE_REQUEST preserve the original path moving forward to the redirects?

You may need to get a grip on one more thing.

The "Request" isn't what the human user asked for by clicking a link, selecting a bookmark or typing in a name. It's what the browser asked for. Hence the term "User Agent". It's doing things on your behalf.

Every time you go to a www page, the browser takes a quick look at the html, makes a shopping list of images and css files, and asks the server to send them along, one by one. In the server logs, each of those will look exactly the same as the original page request.

If the original human request runs into a redirect, your browser doesn't come back and say "Oops, we've got a problem, do you want to try the side door?" It just puts in a new request and your address bar magically changes. But, again, the site's logs will show two separate requests.

Each request is an island. If you get redirected and the browser puts in a fresh request and you're back at the same site a millisecond later, the server will act as if it has never seen you before in its life. That's why you sometimes get a browser error telling you that a redirect is going around in circles so it's going to pull the plug. The server can't tell; the browser has to step in and rescue you.

An internal rewrite is different, because now there are two things. One is what the browser asked for; the other is what the server is giving it. Obvious example: You ask for example.com/index.html, you get redirected to example.com/ alone (assuming it's been coded properly) and end up seeing a page... whose filename happens to be index.html.

Behind the scenes, mod_rewrite has processed two separate requests for "index.html". One gets redirected; the other gets waved on through. The difference is the %{THE_REQUEST} line, which essentially tells the server "It is OK to hand over this file, but only if the visitor didn't ask for it." Which may sound a bit perverse, but there you are.

In some cases, you can avoid looking at %{THE_REQUEST} by using the [NS] flag instead. This means essentially the same thing: "The browser didn't ask for this file, the server did." In my case, f'rinstance, almost everything in .php gets the [NS]: the server is welcome to use the files, but humans should keep their grubby hands off ;)

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4544878 posted 1:56 pm on Feb 13, 2013 (gmt 0)

Would THE_REQUEST preserve the original path moving forward to the redirects?

if by "moving forward to the redirects" you mean the subsequent request after the 301 response, the answer is no, that's a new request so THE_REQUEST will contain the new request.

moroandrea



 
Msg#: 4544878 posted 2:48 pm on Feb 13, 2013 (gmt 0)

@All

thanks for your time and efforts to get this explained to me. Very appreciated, though I still have some doubts.

@lucy24
The difference is the %{THE_REQUEST} line, which essentially tells the server "It is OK to hand over this file, but only if the visitor didn't ask for it."

Not sure I get this. If {THE_REQUEST} is processed all the time, what could be the benefit?

I'm perplexed because reading further, in the next paragraph you mention [NS] to be the same of {THE_REQUEST}.

So should I consider the {THE_REQUEST} and [NS] useful or perhaps ok to be used only when I'm treating everything that has not been:
a) typed in the URL address bar by a user
b) a link clicked from somewhere

According to @phranque, {THE_REQUEST} will always contain the new request, hence if I have a URL rewritten from a.html to b.html, the second {THE_REQUEST} should be something like GET b.html HTTP etc. etc.

I believe that if a working example can be made on top of what @g1smd did, I should probably be able to understand the scenario.

Thanks for your patience guys.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4544878 posted 2:57 pm on Feb 13, 2013 (gmt 0)

if I have a URL rewritten from a.html to b.html, the second {THE_REQUEST} should be something like GET b.html HTTP

I think you are mixing up redirects and rewrites.

If a URL request is redirected to a new URL, there will be a second request for that new URL. THE_REQUEST will be different each time.

If a URL request is rewritten, the server tries to serve the content from the rewritten internal location - unless another rule accidentally matches and does something unexpected. THE_REQUEST will contain the originally requested URL.

moroandrea



 
Msg#: 4544878 posted 3:19 pm on Feb 13, 2013 (gmt 0)

@g1smd yes possibily I did get confused by redrects and rewrites in this particular instance.

In any case rules will be executed from top to the bottom of the file. So what I need to pay attention to is what I am asking and avoid clashes.

As usual there is not just one route to achieve the goal.

Now going back to your answer.

If a URL request is rewritten, the server tries to serve the content from the rewritten internal location


Does this mean that rewritten URL because they are not making a new request, they will keep processing the rules without the .htaccess to be parsed again from the beginning (unless the [L] won't stop processing)?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4544878 posted 10:35 pm on Feb 13, 2013 (gmt 0)

So should I consider the {THE_REQUEST} and [NS] useful or perhaps ok to be used only when I'm treating everything that has not been:
a) typed in the URL address bar by a user
b) a link clicked from somewhere

NO. The server doesn't know whether the request came from a human or from the human's browser. To reiterate: If something is redirected, the human has not taken any new action. But their browser (or other User Agent) has-- and that action results in a whole new request. And, unless you've put some special extra business in the redirect target* there is no way to know that this second request was prompted by a redirect.

That's why I used the index.html example. The human starts out by typing "index.html" and ends up seeing the page whose name happens to be "index.html" BUT the page display is not a response to what the human user typed.


* Like, say,

RewriteCond %{THE_REQUEST} index
RewriteRule ^index\.html http://www.example.com/?redir=1 [R=301,L]


or

RewriteCond %{THE_REQUEST} index
RewriteRule ^index\.html http://www.example.com/ [R=301,L,CO=redir:1:.example.com]

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4544878 posted 11:23 pm on Feb 13, 2013 (gmt 0)

Does this mean that rewritten URL because they are not making a new request, they will keep processing the rules without the .htaccess to be parsed again from the beginning (unless the [L] won't stop processing)?

URLs don't get rewritten. URL requests get rewritten - or redirected - depending on your htaccess rules.

Let's be sure you fully understand the process.

htaccess doesn't make URLs for content. htaccess acts on URL requests after a link on a page is clicked or after a URL is typed into the browser address bar or a browser bookmark is selected.

Each Apache module parses the htaccess file in turn. Each module ignores rules belonging to other modules. Each module parses the rules in the order they are presented in the file. Once a rule has been invoked, the htaccess file is parsed again from the beginning to make sure no other rules match this request.

URLs are used "out there" on the web. Filepaths and files are used "here" inside the server. They are two different reference systems and are related merely by your server configuration.

You might have a file called /index.html stored on the server hard drive.

You might be able to access that file by requesting the URL www.example.com/index.html

However, it's much better if you use the URL www.example.com/ instead. When you do that, an inbuilt default internal rewrite that associates the URL path "/" with the internal file "/index.html" kicks in.

This raises a problem. You have now two URLs that can return the same content - duplicate content.

You fix this with a redirect. Request the URL example.com/index.html or www.example.com/index.html and the server now returns 301 status and informs the browser to make a new request for the URL www.example.com/

The browser makes the new URL request for www.example.com/ and the internal rewrite serves the content from the /index.html file - without letting on where inside the server that content comes from.

The redirecting rule MUST test that THE_REQUEST contained "index.html" to avoid the infinte loop.

The REDIRECT
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html$ http://www.example.com/$1 [R=301,L]


The REWRITE
RewiteRule ^$ /index.html [L]

The final rule is actually implemented using
DirectoryIndex index.html in real life.
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved