Welcome to WebmasterWorld Guest from 18.207.132.114

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

mod rewrite and mod substitute

     
8:55 pm on Apr 22, 2015 (gmt 0)

New User

joined:Apr 20, 2015
posts:7
votes: 0


Hello All!

After reading a multitude of posts here and elsewhere as well as the Apache docs on the mod_rewrite and mod_substitute topics, I believe I have a better understanding of how to approach my problem. I am new to working with an Apache server and will undoubtedly make mistakes and have misunderstandings. My goal in reaching out to you is to minimize those mistakes and understand why they are mistakes/errors/etc and how to avoid making them in the future.

I've been tasked with creating rewrite rules to match existing ones in an iis web.config. The iis url rewrite module consists of inbound rules/redirects to modify requests and outbound rules to modify responses. If I am correctly understanding the Apache module capabilities, mod_rewrite can take care of the requests and mod_substitute can handle the responses. I wanted to get the ball rolling and have some general rewrite/substitute templates ready to go for when a test environment is set up in a few days.

I have broken down the rules needed to a small set that should be applicable (essentially repeatable with slight modifications) to the vast majority of the urls. I'll also need to implement caching and compression - not sure if that has anything to do with a .htaccess file yet, but I'd like to concentrate on the rewrite and substitution for now. Perhaps each module in question deserves a different post. If that is indeed the case, please advise me so.

Apache httpd version 2.2.15



---mod_rewrite---


These rules will be implemented in a .htaccess in the root.

The basic structure of a
- page: category-sub-category.php
- static friendly URL: http://example.com/category/sub-category
- dynamic friendly URL type 1: http://example.com/category/sub-category?querystring
- dynamic friendly URL type 2: http://example.com/parameter1/parameter2

What needs 'translation' from iis
I. Resource white-list rule
II. External 301 redirects
A. Avoid duplicate content on static urls
B. Canonical redirect
III. Internal rewrites to process incoming user-friendly urls
A. 1 Rule per category
B. 1 Rule for each type of dynamic url
1. Type 1
2. Type 2

#I. Resource Whitelist
#I just want it to "do nothing" and stop processing rules [L]
#if the file extension is matched (css|js|gif|ico|img|jpg|php|png)
#in given directories below the root. This would be the first rule to save time.
RewriteCond %{REQUEST_FILENAME} !^(.*)/(.+)\.(css|js|gif|ico|img|jpg|php|png)$
RewriteRule ^(d1/(.*)|d2/(.*)|d3/(.*)|d4/(.*)) - [L]

#II A. External redirects for each sub-category
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /category(.+)\.php\ HTTP/
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^category-sub-category\.php(.*) http://example.com/category/sub-category/$1 [R=301,L]

#II B. External redirect for non-canonical hostname requests to canonical hostname
RewriteCond %{HTTP_HOST} ^www\.example\.com(.*)$
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^(.*)$ http://example.com(/)?(%1)? [R=301,L]

#III A. Internal rewrite for category/sub-category
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^category/([^/]+)$ category-$1.php [L]

#III B 1. Internal rewrite for category/sub-category?querystring
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^category/([^_/]+)/([^/]+)$ category-$1.php?$2 [NC,L]

#III B 2. Internal rewrite for example.com/paramater1/parameter2
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(?!category1|category2|category3)([^/]+)/([^/]+)$ page.php?parameter1=$2&parameter2=$3 [NC,L]


-Questions-
Do any of you recommend or caution against using positive and negative lookaheads to reduce the amount of rules needed? I could make 15 rules all a little bit different or 1 rule with 15 lookahead values; i.e. adding a new category would only involve adding '|newcategory' to the lookahead list as opposed to adding another rule. I'm not sure of the potential speed benefits/costs or if it ends up being inconsequential.

Does mod_rewrite handle + signs in urls by default (e.g. http://example.com/search/term1+term2+term3 ? I ask because iis required double escaping to be enabled to process them, and that took me forever to figure out.


---mod_substitute---


For the mod_substitute (response) side at minimum any <a href links will need to be substituted.

Dynamic url links need to be in the proper form for the rewrite section to catch them.

I'll need to search for "<a href= (.*)category-sub-category.php" and substitute the friendly URL versions.


AddOutputFilterByType SUBSTITUTE text/html

#Substitution to example.com/paramater1/parameter2
Substitute "s|<a\ href="(.*)page\.php\?parameter1=([^=/&amp;]+)&amp;(?:amp;)?parameter2=([^=/&amp;]+)">|<a href="$1/$2/$3">|i"

#Substitution to example.com/category/sub-category?querystring
Substitute "s|<a\ href="(.*)category-([^/]+)\.php\?([^/]+)">|<a href="$1/$2?$3">|i"

#Substitution to example.com/category/sub-category
Substitute "s|<a\ href="(.*)category-([^/]+)\.php">|<a href="$1/category/$2">|"

#Substitution to example.com/category
Substitute "s|<a\ href="(.*)category\.php">|<a href="$1/category">|"



-Questions-
Should I be escaping the spaces, ., /, and ? in the substitution regexes?

What is the significance of using quotation marks around the entire substitution pattern|replacement? Are they used only when using a regex pattern and not used when the n flag is set? That's about the only difference I could see in the docs.

What does 'flattening' buckets mean/do for substitutions?
10:39 pm on Apr 22, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15944
votes: 890


RewriteCond %{REQUEST_FILENAME} !^(.*)/(.+)\.(css|js|gif|ico|img|jpg|php|png)$
RewriteRule ^(d1/(.*)|d2/(.*)|d3/(.*)|d4/(.*)) - [L]
I assume the RewriteCond is intended to intercept requests containing exactly one directory slash. But as written it will instead match all requests containg one or more directory slashes. The pattern you want is
^[^/]+/[^/.]+\.xtn
but really, this should go in the body of the rule. Is it literally d1 d2 etc or did you just make those up for posting? If those were the actual directory names the pattern would be
^d[1-4]/blahblah
where "blahblah" means "we need to hash this out some more".

RewriteRule ^category-sub-category\.php(.*) http://example.com/category/sub-category/$1 [R=301,L]
I don't think this rule does what you intend it to do, unless you've got weird URLs. The body of the rule looks only at the path. If you're trying to capture a query string, that needs to go in a RewriteCond. Two of 'em in fact, to avoid infinite loops:
RewriteCond %{THE_REQUEST} \?
RewriteCond %{QUERY_STRING} (.*)
RewriteRule ^category-sub-category\.php http://example.com/category/sub-category/%1 [R=301,L]
Note the change from $ to % to match a capture from a condition. You can only capture from the most recently matched condition (generally that means the one right before the Rule itself) so pay attention to order of Conditions.

II B. External redirect for non-canonical hostname requests to canonical hostname
RewriteCond %{HTTP_HOST} ^www\.example\.com(.*)$
This should be the very last external redirect, and the condition is best expressed as
!^(example\.com)?$
meaning "exactly this or exactly nothing". There is absolutely no point to doing a filename lookup here; in fact you've got way too many of them everywhere. Looking up to see whether a file exists will be more work for the server than just issuing the darned redirect regardless. (Analogies present themselves.)

Does mod_rewrite handle + signs in urls by default

Plus signs have their ordinary RegEx meaning, so in patterns they have to be escaped as \+

Should I be escaping the spaces, ., /, and ? in the substitution regexes?

In a target, absolutely nothing needs to be escaped. In a pattern, escape only . and ? from the list you gave. Not commas or slashes. You have to escape spaces everywhere, because a space has syntactic meaning in Apache. (In some situations, such as mod_setenvif, you can put a string in quotation marks instead. Don't try it in mod_rewrite. In one or two Apache mods-- again, not in mod_rewrite-- the slash / has syntactic meaning, so then it needs to be escaped. Otherwise no.)

But of course your targets will never contain blank spaces!

What is the significance of using quotation marks around the entire substitution pattern|replacement?

Nothing. It isn't needed. There are rare situations where quotation marks are used in a RewriteCond, but normally they're not used for anything.

I probably missed some stuff, but that's a start. Welcome to Apache!
2:37 pm on Apr 23, 2015 (gmt 0)

New User

joined:Apr 20, 2015
posts:7
votes: 0


Hi Lucy and thank you for the welcome and the helpful reply.

I assume the RewriteCond is intended to intercept requests containing exactly one directory slash. But as written it will instead match all requests containg one or more directory slashes.


This is actually intended due to there being resources used in directories one directory slash below and beyond. So the urls for those will always have one slash and could possibly have more.

The d1|d2|d3 directories were made up for posting. Real examples are css|fonts|images|js. I apologize for not making that clear.

It might be better if I rewrote what I'd like this to accomplish. Stop processing rewrite rules on any request for any of the listed file extensions within any level of the listed directories. With your suggestions, I changed this to:


#I. Resource Whitelist
RewriteCond %{REQUEST_FILENAME} ^(css/(.*)|fonts/(.*)|images/(.*)|js/(.*))
RewriteRule ^(.*)/(.+)\.(css|js|gif|ico|img|jpg|php|png) - [L]





#II A. External redirects for each sub-category
RewriteCond %{THE_REQUEST} \?
RewriteCond %{QUERY_STRING} (.*)
RewriteRule ^category-sub-category\.php http://example.com/category/sub-category/%1 [R=301,L]

You can only capture from the most recently matched condition (generally that means the one right before the Rule itself) so pay attention to order of Conditions.

That's extremely useful to know. You just answered a question I didn't know I had. I'm assuming multiple captures in a condition may be accessed with %1... %N.




#II B. External redirect for non-canonical hostname requests to canonical hostname
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule ^(.*)$ http://example.com [R=301,L]

There is absolutely no point to doing a filename lookup here; in fact you've got way too many of them everywhere.

Thank you for this bit too. I get your point. In this rule, it's absolutely irrelevant. In the others I guess I am essentially asking the server to look for files or directories that aren't going to be there. A bit of a fruitless errand. The fog is lifting. With your help in mind, here are the modified rewrite rules for the remaining situations

#III A. Internal rewrite for category/sub-category
RewriteRule ^category/([^/]+)$ category-$1.php [L]

#III B 1. Internal rewrite for category/sub-category/querystring
RewriteRule ^category/([^_/]+)/([^/]+)$ category-$1.php?$2 [L]

#III B 2. Internal rewrite for example.com/paramater1/parameter2
RewriteRule ^(?!category1|category2|category3)([^/]+)/([^/]+)$ page.php?parameter1=$2&amp;parameter2=$3 [L]




I think the only current outstanding items I was curious about were lookaheads - ease of use vs. performance, if it even makes a difference in a url rewrite situation, and if the substitutions below were syntactically correct.


#Substitution to example.com/paramater1/parameter2
Substitute s|<a\ href="(.*)page\.php\?parameter1=([^=/&amp;]+)&amp;(?:amp;)?parameter2=([^=/&amp;]+)">|<a href="$1/$2/$3">|i

#Substitution to example.com/category/sub-category?querystring
Substitute s|<a\ href="(.*)category-([^/]+)\.php\?([^/]+)">|<a href="$1/$2?$3">|i

#Substitution to example.com/category/sub-category
Substitute s|<a\ href="(.*)category-([^/]+)\.php">|<a href="$1/category/$2">|

#Substitution to example.com/category
Substitute s|<a\ href="(.*)category\.php">|<a href="$1/category">|


Thanks for your help so far Lucy!
6:44 pm on Apr 23, 2015 (gmt 0)

New User

joined:Apr 20, 2015
posts:7
votes: 0


I think I may have misinterpreted what was meant by body for the whitelist rule so would this be better and accomplish the same thing?

#I. Resource Whitelist
RewriteRule ^(css|fonts|images|js)/(.*)\.(css|js|gif|ico|img|jpg|php|png) - [L]
8:00 pm on Apr 23, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15944
votes: 890


Short version: never put something in a RewriteCond that can go in the body of the rule. Yes, "body" means the RewriteRule itself. mod_rewrite works on a "two steps forward, one step back" system where conditions are only evaluated if the rule itself can potentially apply. For example if you have
RewriteCond blahblah1
RewriteCond blahblah2
RewriteRule blahblah3 do-stuff-here
then mod_rewrite won't even look at blahblah1 and blahblah2 unless the request contains the element "blahblah3".

Any time you have a RewriteCond involving a positive match on %{REQUEST_URI} (i.e. no ! negative), that's something that can go in the rule itself. (Sometimes you may still have a Request URI condition to simplify capturing, but then it's a pseudo-condition that you already know has been met.)

I'm assuming multiple captures in a condition may be accessed with %1... %N.

Yes, but you can only go to one digit. If you happen to have something with nested elements, or pipe-separated groups, you can use (?:blahblah) no-capture markup so you don't have to keep count and end up with a $1$4%2 etcetera in the target.

Now, about that "stop right here" rule. Do there exist supporting files that you do want to redirect? If not, you could always just say
RewriteRule \.(css|js|gif|ico|img|jpg|php|png)$ - [L]

But, on the other hand, your css etc directories don't contain any pages, do they? If they don't, it's even easier-- for both you and the server-- to say
RewriteRule ^(css|fonts|images|js)/ - [L]

and then it doesn't matter what the rest of the URL looks like, because you already know nothing further will be happening.


Edit: I don't speak Apache-- I just do Regular Expressions-- so phranque will come around by and by to deal with questions specific to mod_substitute.

Further edit: I doubt there's any advantage to using a lookahead instead of a RewriteCond. In some situations maybe, but I'm inclined to think it's the same amount of work for the server either way. In htaccess, unlike in the config file, Regular Expressions have to be re-compiled on the fly each and every time.
8:47 pm on Apr 23, 2015 (gmt 0)

New User

joined:Apr 20, 2015
posts:7
votes: 0


Now, about that "stop right here" rule. Do there exist supporting files that you do want to redirect? If not, you could always just say...
I'll have to take a look and double check once the test environment is set up and I can actually make sure that these urls are matching up with the current ones. There are one or more directories that do need redirection and that was the reasoning behind explicitly listing those that did not.

But, on the other hand, your css etc directories don't contain any pages, do they? If they don't, it's even easier-- for both you and the server-- to say...
The majority of the directories do not contain pages. I can always start out with the general rule referencing the directories and see how it functions. If need be then I can get more specific.

Thank you so much for your help on the regexes and rewrites! It really helps to have a knowledgeable pair of eyes dissect, point out mistakes, and make suggestions and improvements.
12:54 pm on Apr 30, 2015 (gmt 0)

New User

joined:Apr 20, 2015
posts:7
votes: 0


Quick update. The internal redirects are all working smoothly. I'm currently working on the substitutions. Once they are set up, I will check the redirects to avoid the dreaded duplicate content.

I am running into a couple issues with the substitutions.

I originally wanted to be able to substitute by looking for html tags as seen below.
#Substitution to example.com/category
Substitute s|<a\ href="(.*)category\.php">|<a href="$1/category">|
With my limited knowledge, all I can say is that the server did not like the escaped space. When I tried beginning with the href=,
#Substitution to example.com/category
Substitute s|href="(.*)(category)\.php">|href="$1$2|
things worked like a charm for static links.

One of the issues I'm bumping up against is links in 'facets' - checkbox or dropdown choices. These links are generated dynamically with php and appear inside custom html tags. This hearkens back to my original notion of substitution by tags which didn't work out so well. Perhaps the solution is as simple as adding a capture (.*) for a querystring on the end of the existing substitution entry. Perhaps this is a case for the buckets I was curious about in my OP.

The dropdown facets appear in the response as:
<option value="/category.php?parameter1=value1">
<option value="/category.php?parameter1=value2">
<option value="/category.php?parameter1=value3">


so phranque will come around by and by to deal with questions specific to mod_substitute.

I'm looking forward to this!
4:53 pm on Apr 30, 2015 (gmt 0)

New User

joined:Apr 20, 2015
posts:7
votes: 0


I worked out the dropdown facets

It was only substituting the last <option, and that was due to the wildcard (.*) in the pattern. When I changed it to (/?) and fixed another small issue, it substituted for all of them.

I did manage to get my other potential hangup taken care of, but I came across another problem.

The site is responsive, and some of the substitutions don't work when in full-blown 'desktop' but work when the site gets to mobile dimensions. It's easy enough to write the desired urls in the handful of places, but I just thought it was odd that it was occurring in the first place.
5:58 pm on Apr 30, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15944
votes: 890


First explanation that comes to mind: Process A is happening before process B, and you need it to happen after process B. I realize that was an awfully generic answer, but it's the line of inquiry I'd follow.
8:21 pm on Apr 30, 2015 (gmt 0)

New User

joined:Apr 20, 2015
posts:7
votes: 0


I'm picking up what you're putting down. Thanks for the tip!