Forum Moderators: phranque

Message Too Old, No Replies

Fun with regex: grouping strings

         

csdude55

6:12 pm on Apr 4, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is really more of a general regex question, but I'm specifically playing with my .htaccess so I put it in the Apache forum :-)

I have this:

RewriteCond %{REQUEST_URI} ^/(ads|images|cgi-bin)

I grouped ads|images|cgi-bin because I only want to match if it begins with / followed by either of those.

But then I found that it matches the same if I leave out the ():

RewriteCond %{REQUEST_URI} ^/ads|images|cgi-bin

My question is, why? I thought that "/ads" would be seen as the first string, then "images" (without the /) would be the second, so "images" would have never matched... but it does.


Second question...

I'm also wondering about processing speed vs. download time of the .htaccess file. In this case I'm really talking about a few bytes, so this is really more for my education rather than actual results.

I have conditions that look like:

RewriteCond %{REQUEST_URI}(/index\.php)? !-f
RewriteCond %{REQUEST_URI} !-d
RewriteRule ^(.*) /foo/$1 [L]

(The rule is obviously not real, but I have those 2 conditions before just about every RewriteRule)

I have (/index\.php)? grouped to make the string optional, but I never need to reference %1. So the question is, should I use (?:index\.php)? to keep it from being stored as %1? Or does the 2 bytes of file size (meaning, slightly longer to load) offset any performance gain from it?

lucy24

7:06 pm on Apr 4, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But then I found that it matches the same if I leave out the ():

RewriteCond %{REQUEST_URI} ^/ads|images|cgi-bin

My question is, why?
Because it isn’t only matching ^/images. It’s also matching
/directory/images/
/directory/subdir/images/
/allimages/
/imagesensor/
et cetera, et cetera. This strikes me as perilous. Besides, it means the server has to check all the way through the request string, instead of only at the beginning.

should I use (?:index\.php)?
I think you should use the [NS] flag, so as to bypass all internal requests for /blahblah/index.php (I assume you do not have it in visible URLs).

but I have those 2 conditions before just about every RewriteRule
Frankly that sounds wildly inefficient, since it means the server has to go looking for files and directories over and over and over again. I don't know how big your site is, but is listing everything by name in the body of the rule an option? In particular, anything involving a rewrite to
/foo/blahblah
should start with a RewriteCond along the lines of
RewriteCond %{REQUEST_URI} !^/foo/

w3dk

9:54 pm on Apr 5, 2020 (gmt 0)

10+ Year Member Top Contributors Of The Month



RewriteCond %{REQUEST_URI} ^/ads|images|cgi-bin

My question is, why? I thought that "/ads" would be seen as the first string, then "images" (without the /) would be the second, so "images" would have never matched... but it does.


You seem to have ignored the ^ (start of string anchor) in your analysis... it's "^/ads" OR "images" OR "cgi-bin". So, as lucy24 has pointed out, it matches too much.



RewriteCond %{REQUEST_URI}(/index\.php)? !-f
RewriteCond %{REQUEST_URI} !-d



These two conditions don't make much sense. They will always be successful because the first will never map to a real file and the second will never* map to a real directory (*or I should say, "very unlikely", unless the root of your filesystem happens to match the public URL-path - very unlikely).

The first argument to the RewriteCond directive is the TestString - this is not a regex, so "(/index\.php)?" is matched literally (except for the backslash before the dot). The trailing "?" does not make the preceding group "optional", that is a literal "?"!

The REQUEST_URI is the document-root relative URL-path, not an absolute filesystem path, that the -d operator is expecting.