Forum Moderators: phranque
RewriteEngine On
RewriteBase /
#1 - Redirect requests for old URLs to new URLs
RewriteRule ^old-page\.html?$ http://www.example.com/new-folder/new-page [R=301,L]
# Then repeat the above 80 times.
#2 - Redirect index.html or .htm in any directory to root of that directory and force www
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index\.html?[^\ ]*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]
#3 - Redirect all .html requests to .htm on canonical host.
RewriteRule ^([^.]+)\.html$ http://www.example.com/$1.htm [R=301,L]
#4 - Redirect direct client request for old URL with .htm extension
# to new extensionless URL if the .htm file exists
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+\.htm\ HTTP/
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^(([^/]+/)*[^.]+)\.htm$ http://www.example.com/$1 [R=301,L]
#5 - Redirect any request for a URL with a trailing slash to extensionless URL
# without a trailing slash unless it is a request for an existing directory
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)/$ http://www.example.com/$1 [R=301,L]
#6 - Redirect requests for non-www/ftp/mail subdomain to www subdomain.
RewriteCond %{HTTP_HOST} !^(www|ftp|mail)\.example\.com$
RewriteRule ^([^.]+)$ http://www.example.com/$1 [R=301,L]
#7 - Internally rewrite extensionless URL request
# to .htm file if the .htm file exists
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^(([^/]+/)*[^./]+)$ /$1.htm [L]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
The only issue I found so far is that something like example.com/bogus. gets a 301 first to example/bogus then example/bogus gets its 404.
RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt)$ [NC]
...
It does work - and for multiple trailing chars after I got daring and added the + before the $ in the pattern. But am I skating on thin ice here?
Am I the only one who noticed that the OP's original pattern was simpler *and* already worked correctly *and* didn't ignore a huge swath of valid URLs?
# Redirect URL containing valid characters to remove trailing invalid characters
RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt|any-other-file-type-thats-not-a-page)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule ^([/0-9a-z_\-]*)[^/0-9a-z_\-]+$ http://www.example.com/$1 [NC,R=301,L]
# Redirect URL containing valid characters to remove trailing invalid characters
RewriteCond %{REQUEST_URI} !index\.com
RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt|any-other-file-type-thats-not-a-page)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule ^((?:[^/.]+/)*(?:[^/.]+(?:\.pdf)?)?)[^a-zA-Z0-9].* http://www.example.com/$1 [R=301,L]
# Redirect URL containing valid characters to remove trailing punctuation
RewriteRule ^(^\ )[^/0-9a-z]+$ http://www.example.com/$1 [NC,R=301,L]
I think the problem with my first version of #11 is that it's greedy and promiscuous.
Take a look at every framework, every CMS, basically every open source project out there, and I think you'll find that .* is used liberally.
When you really start digging into regex processing and the recursion often caused by the use of .*, especially when you're not "just matching everything" on the line, I think you'll find many time almost anything else is more efficient.
So far as I can tell, I'm the only person on these forums who has ever bothered to actually run a benchmark.
Where did you post the benchmarking test and results?
And if it was an .htaccess test, which botnet did you use to slam the server to simulate a large number of simultaneous requests for large variety of resources?
It looks like our profile pages list only a limited number of past posts. Is there a way to see my full list of posts?
Where the performance gains are usually made is when there not a match, because by "breaking the check" quickly it's possible to save hundreds, or thousands of "possible match checks" over "grab and test it all".
/some/path,". Which means of course that a non-matching URL would be just "/some/path". Here's Lucy's non-.* pattern: ^([\w/-]+(\.\w+)?)?[^a-zA-Z\d].* ^([/0-9a-z_\-]*)[^/0-9a-z_\-]+$ [\w/-]+ or [/0-9a-z_\-]* parts, respectively, right at the beginning would still match a non-matching URL all the way to the end before it realizes it needs to backtrack. It hasn't solved or improved anything. I suspect the .* pattern will be the performance winner here. In the ruleset above, we don't ever hit the condition unless there's something other than a letter, number or / at the end of a line, so if there's not an invalid character we have a single pass break and if there is, then we just grab everything up to it in a single pass and redirect.
Very clever indeed. Later today I'll test it with siege and see what kind of difference it makes.
It works, but it also redirects css files. It's weird, I added the same RewriteCond as above rule to block out css requests, but it still redirects css files.
[edited by: bill at 4:37 am (utc) on Jul 29, 2013]
[edit reason] typo fix [/edit]
Thanks and I'm definitely interested knowing the results of what you find.
The difference could have to do with the .htaccess having to be compiled for every request
It would be horrible to discover that you're better off using a bad RegEx in htaccess...
An optimization catches some of the more simple cases such as (a+)*b where a literal character follows. Before embarking on the standard matching procedure, PCRE checks that there is a "b" later in the subject string, and if there is not, it fails the match immediately. However, when there is no following literal this optimization cannot be used. You can see the difference by comparing the behaviour of (a+)*\d with the pattern above. The former gives a failure almost instantly when applied to a whole line of "a" characters, whereas the latter takes an appreciable time with strings longer than about 20 characters.
[php.net...]
So for example, say you want to optimize a sub-expression like ".*a". If the character a is located near the end of the input string it is better to use the greedy quantifier "*". If the character is located near the beginning of the input string it would be better to use the reluctant quantifier "*?" and change the sub-expression to ".*?a". Generally, I've noticed that the lazy quantifier is a little faster than its greedy counterpart.
Another tip is to be specific when writing a regular expression. Use general sub-constructs like ".*" sparingly because they can backtrack a lot, especially when the rest of the expression can't match the input string. For example, if you want to retrieve everything between two as in an input string, instead of using "a(.*)a", it's much better to use "a([^a]*)a".
[javaworld.com...]
[edited by: JD_Toims at 3:04 am (utc) on Jul 29, 2013]
It would be horrible to discover that you're better off using a bad RegEx in htaccess...
Easy to understand and easy to maintain are the next most significant.
[edited by: JD_Toims at 3:30 am (utc) on Jul 29, 2013]
RewriteCond %{REQUEST_URI} ^/([/0-9a-z_\-]*)
RewriteRule [^/0-9a-z]$ http://www.example.com/%1 [R=301,L]
I mean why match and store everything for every request when we can "implicitly match" (no storage, no back-tracking, works for all URL patterns) then check the end of the line for an invalid character and if we find one we can "grab the good stuff" from the beginning of the URL and redirect?
Rather than excluding non-page extensions in a RewriteCond This becomes vastly easier when you've gone extensionless-- a detail I'd forgotten when I posted-- because then all page URLs come down to
^([^.]*)$
So don't put anything in a condition that you could put in a body of the rule. This particularly replies to conditions in the form "the requested URL is such-and-such".
Yeah that is pretty slick. It works. I've been testing it for the past 20 mins or so. Thanks! I think this is a keeper.
RewriteCond %{REQUEST_URI} ^/([/0-9a-z_\-]*)
RewriteRule [^/0-9a-z]$ http://www.example.com/%1 [R=301,L]
That is, it can't fail.
That is, it can't fail. (I assume the - and _ that are present in the condition but absent from the rule are typos.)
That is, it can't fail. (I assume the - and _ that are present in the condition but absent from the rule are typos.)
They're in the condition and not the rule because they're valid URL characters, but not valid endings of a URL (I'm assuming) and I don't know the exact construction of the URLs on the site we're dealing with well enough to eliminate them... I'd limit the condition or the rule more if I could.
# Redirect URL containing valid characters to remove trailing invalid characters
RewriteCond %{REQUEST_URI} ^/([/\w\-]*)
RewriteRule [^/\w\-]$ http://www.example.com/%1 [R=301,L]
Most of the rules in this thread are Insurance Rules: the kind you don't need until you need them. Doubled directory slashes, extraneous path info, punctuation at the end of an URL, some other stuff which I've forgotten. What the OP is trying to do is construct an htaccess for the ages, so that when he becomes the next YouTube he doesn't have to keep adding more rules to deal with typos.
Oh, wait. If you set up the rule to match only one character, then you can no longer use the self-same rule to get rid of extraneous path info sad Once you've got an extension, anything after it will normally be garbage. But this is the thread that started out with mixed html and htm, wasn't it?
RewriteEngine On
RewriteBase /
#1 Redirect requests for old URLs to new URLs
RewriteRule ^old-page\.htm$ http://www.example.com/new-folder/new-page [R=301,L]
# Then repeat the above 80 times.
#2 Redirect index requests in any directory to root of that directory
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/]+/)*index(\.[a-z0-9]+)?[^\ ]*\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index(\.[a-z0-9]+)?$ http://www.example.com/$1? [NC,R=301,L]
#8 Redirect remaining .htm or .html requests to extensionless URL
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+\.html?\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*[^.]+)\.html?$ http://www.example.com/$1 [NC,R=301,L]
#9 Redirect URLs containing valid characters to remove query string except for specific folders
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^?#\ ]*)\?[^\ ]*\ HTTP/
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteRule (.*) http://www.example.com/$1? [R=301,L]
#11 Redirect URLs containing valid characters to remove trailing invalid characters
RewriteCond %{REQUEST_URI} ^/([/\w\-]*)
RewriteRule [^/\w\-]$ http://www.example.com/%1 [R=301,L]
#5 Redirect requests with trailing slash to extensionless URL if .htm file exists
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /([^/\ ]+/)*[^.\ ]+/\ HTTP/
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^(([^/]+/)*[^.]+)/ http://www.example.com/$1 [R=301,L]
#6 Redirect requests for non-www and non-webmail subdomains to www subdomain
RewriteCond %{HTTP_HOST} !^(www|webmail)\.example\.com$ [NC]
RewriteCond %{HTTP_HOST} example\.com [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#13 Redirect https requests to http except for specific file types, folders, and file
RewriteCond %{SERVER_PORT} ^443$
RewriteCond $1 !\.(css|gif|jpe?g|bmp|png|js|ico|xml|txt)$ [NC]
RewriteCond $1 !^(shopping-cart-folder|site-stats-folder)/
RewriteCond $1 !^file1
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
#7 Internally rewrite extensionless URL requests to .htm file if .htm file exists
RewriteCond %{THE_REQUEST} ^[A-Z]+\ /[^.]+[^./]\ HTTP/
RewriteCond %{REQUEST_FILENAME}.htm -f
RewriteRule ^([^.]+[^./])$ /$1.htm [L]
RewriteRule ^(([^/]+/)*[^.]+)\.html?$