homepage Welcome to WebmasterWorld Guest from 54.226.168.96
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Secure 301 Redirect that doesn't create infinite loop
kimmiem




msg:4636074
 3:10 am on Jan 9, 2014 (gmt 0)

This is my first .htaccess file redirect, so I appreciate help making sure it's done right and it's secure. I needed to 301 redirect the non-www to the www, and 301 redirect index.html and default.asp to the homepage 'example.org'. So far, everything seems to be working great. Here is the code:

RewriteEngine On
RewriteCond %{THE_REQUEST} /index\.html? [NC]
RewriteRule ^(.*/)?index\.html?$ /$1 [R=301,L]
RewriteCond %{THE_REQUEST} /default\.asp? [NC]
RewriteRule ^(.*/)?default\.asp?$ /$1 [R=301,L]
RewriteCond %{HTTP_HOST} !^www\..* [NC]
RewriteRule ^(.*) http://www.%{HTTP_HOST}/$1 [R=301,L]

Is there any way to improve it for security? Will I have any problems with this code if there is a Wordpress blog in a subdirectory on the site? (I read somewhere that WP redirects everything to index,php?) I was concerned initially about creating an infinite loop, but when tested with DNS Tools, everything checks out.

My only other concern is that there are 2 redirects for the homepage if I start without the www - one drops the index.html or default.asp, the other adds the www. Is there code that can do this in one step?

Thanks so much! Believe it or not, I spent hours figuring this out!

[edited by: phranque at 4:36 am (utc) on Jan 9, 2014]
[edit reason] unlinked url for clarity [/edit]

 

kimmiem




msg:4636083
 3:49 am on Jan 9, 2014 (gmt 0)

Actually, the final line should look like this: RewriteRule ^(.*) http://www.%{HTTP_HOST}/$1 [R=301,L]

As for my last question, I just read this in the forum: 'Always give the full protocol-plus-domain in the target of a redirect. Otherwise some requests will get redirected twice if they came in asking for the wrong form of the domain name.' I suspect that helps, but I'm not sure how to implement it.

[edited by: phranque at 4:37 am (utc) on Jan 9, 2014]
[edit reason] unlinked url for clarity [/edit]

kimmiem




msg:4636084
 3:53 am on Jan 9, 2014 (gmt 0)

Darn it, the final line isn't showing correctly.
Instead of this [%{HTTP_HOST}...]
it should behttp://www.%{HTTP_HOST}/$1

I learned to preview;)

kimmiem




msg:4636093
 4:37 am on Jan 9, 2014 (gmt 0)

Maybe I can use the OR flag to consolidate the first 2 conditions since the rule is the same?

Should I only be using the L flag once?

To solve the final question in my original post and avoid 2 redirects, can I change

RewriteRule ^(.*/)?index\.html?$ /$1 [R=301,L]
to
RewriteRule ^(.*/)http://www.%{HTTP_HOST}/$1 [R=301,L]

(but with a space before the http)?

I'm learning as I go, trying to figure this out as much as I can - sorry for all the additional messages and thanks.

phranque




msg:4636094
 5:01 am on Jan 9, 2014 (gmt 0)

welcome to WebmasterWorld, kimmiem!


the problem with using %{HTTP_HOST} is it will serve whatever hostname was requested.
if your server is accepting requests for http://example.com/ and also http://www.example.com/ then only one of those should be the canonical hostname and requests for the non-canonical hostname should be 301-redirected to the other hostname.


I just read this in the forum: 'Always give the full protocol-plus-domain in the target of a redirect. Otherwise some requests will get redirected twice if they came in asking for the wrong form of the domain name.' I suspect that helps, but I'm not sure how to implement it.

whenever you have an external redirect the target should provide the canonical protocol and hostname.
assuming www.example.com is your preferred hostname, your redirects would look like this:

RewriteEngine On

# redirect user agent requests for index.html or default.asp in any directory to directory index (trailing "/")
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*(index\.html|default\.asp)\ HTTP/
RewriteRule ^(([^/]+/)*)(index\.html|default\.asp)$ http://www.example.com/$1 [R=301,L]

# redirect non-canonical hostname variants to www.example.com
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

lucy24




msg:4636107
 6:38 am on Jan 9, 2014 (gmt 0)

Sanity-saving rule: leave a blank line after each RewriteRule, so you can see where one ruleset (rule + preceding conditions) ends and the next one begins. Blank lines have no syntactic meaning in mod_rewrite (or, I think, anywhere else in apache).

RewriteCond %{THE_REQUEST} /index\.html? [NC]
RewriteRule ^(.*/)?index\.html?$ /$1 [R=301,L]
RewriteCond %{THE_REQUEST} /default\.asp? [NC]
RewriteRule ^(.*/)?default\.asp?$ /$1 [R=301,L]

Is "default.asp" only used as a directory-index name? If so, you can probably consolidate the rules as ... oops, n/m, phranque said that already.

You only need the "html?" form if both "index.html" and "index.htm" actually occur on your site. Or, I guess, if you get a lot of type-ins from diehard computer geeks who supply a filename when there isn't one. Otherwise it is perfectly OK to return a 404 for imaginary filenames.

In your index redirect, the exact form of the condition looking at {THE_REQUEST} depends partly on-- again-- what filenames actually occur. If you do not have, and never will have, any other files or directories containing the string "index" or "default" then you could get away with

RewriteCond %{THE_REQUEST} index|default

In the body of the rule
RewriteRule ^(([^/]+/)*)(index\.html|default\.asp)$ et cetera

you could even leave off the closing anchor. That way you'll also grab requests with trailing garbage. That's assuming for the sake of discussion that you will never have directories named "index.html" or similar. And, hm, may as well express the capture as [^./] to get you out of there a little sooner. Mine tend to say \w instead, but that's only because hyphens give me the fantods.

kimmiem




msg:4636189
 4:15 pm on Jan 9, 2014 (gmt 0)

First of all, thank you! Yesterday was my first attempt at .htaccess or rewriting or anything outside of html, css, or basic php so bear with me, thanks. The confusing thing is that there is not one solution, but many. Mind if I ask for clarification so I understand what I'm doing for future reference?

phranque: Double redirect problem solved clearly, thank you! There are however symbols in your code that I don't understand the meaning of.

1.I'm assuming this [A-Z]{3,9} means any letter from a to z or number from 3 to 9, but why not 0,1,or 2?

2. I understand ^ starts the pattern, but do I not need $ to end it in RewriteCond?

3. I don't entirely understand the purpose of this: \ /([^/]+/)* OR \ HTTP/

lucy24: thank you for clarifying things as well!

4.I love the idea of getting out sooner:) How does [^./] do that?

5.From your explanation, should my rewrite look like this?
RewriteCond %{THE_REQUEST} index|default
RewriteRule ^(([^/])*)(index|default) http://www.example.com/$1 [R=301,L]

6. And why drop HTTP/ from phranque's example?

Thx

phranque




msg:4636198
 4:59 pm on Jan 9, 2014 (gmt 0)

1.I'm assuming this [A-Z]{3,9} means any letter from a to z or number from 3 to 9, but why not 0,1,or 2?

actually it's from A to Z.
an HTTP Request starts with a method such as GET or POST.
GET is the shortest possible method - don't remember at the moment what the longest is but it's 9 characters, all uppercase.

2. I understand ^ starts the pattern, but do I not need $ to end it in RewriteCond?

once you've matched what looks like a url preceded by a space and an HTTP Request method and followed by a space and "HTTP/" you're good to go without getting to the end.

3. I don't entirely understand the purpose of this: \ /([^/]+/)* OR \ HTTP/

"\ /([^/]+/)*" matches a blank (the backslash escaped the literal character that follows) followed by a slash followed by zero or more "path segments with trailing slashes"
"\ HTTP/" matches a blank followed by HTTP followed by a slash.


How does [^./] do that?

that character class matches "not a dot or slash"

lucy24




msg:4636235
 8:24 pm on Jan 9, 2014 (gmt 0)

I like people who ask questions :) It means they're planning to learn.

2. I understand ^ starts the pattern, but do I not need $ to end it in RewriteCond?

You may have misunderstood the purpose of ^ and $. In mod_rewrite, there is only one syntactic element, and that's the blank space " ". That's why literal spaces have to be escaped \  in mod_rewrite, though this isn't part of ordinary Regular Expressions syntax. So ^ doesn't mean "Now we're starting on the request / URL path / body of the rule"; the preceding blank space has already indicated this. And the same for $ at the end. Instead, the anchors mean "The text we're currently looking at must begin (or end) in such-and-such a way".

In the body of a rule, an opening anchor means "this is the very first thing in the URL, so if you don't see it right away you can stop looking". A closing anchor in the body of a rule is most often used when looking at an extension; it's rarely necessary otherwise. Anchors are also necessary when a capture pattern involves a negative like [^/] "any non-slash" or [^.] "any non-period". If you don't use the anchor, a Regular Expression-- anywhere, not just in mod_rewrite within Apache --will start capturing after the part that doesn't fit. Or stop capturing when it meets a non-match. (This part depends on RegEx engine; I don't know which one Apache uses if it's got a choice, but the point is that you don't want the tool making the decision for you.) So:

[^.]+
= "the part of the text before or after the period, but not both"

^[^.]$
means "the entire text, but only if it contains no period anywhere"

3. I don't entirely understand the purpose of this: \ /([^/]+/)* OR \ HTTP/

See above. Literal spaces have to be \ escaped because a space has syntactic meaning.
[^/] = any non-slash character
[^/]+/ = one or more non-slashes followed by a slash, i.e. one directory in an URL path. Note here that the / has no special meaning in mod_rewrite, though it does in a few Apache mods and maybe in other languages you know. Here it's just the literal / character.

4.I love the idea of getting out sooner:)! How does [^./] do that?

[^/] = any non-slash. The last part of an URLpath, such as "index.html", matches this pattern, so your RegEx engine then has to backtrack "Oh, oops, I was supposed to leave room for 'index.html'"
[^/.] = anything that's neither a slash nor a period. The RegEx engine will still try to pick up "index" -- there's no way to prevent this --but then as soon as it hits the . before "html" it stops short and begins backtracking.

This is assuming for the sake of discussion that you have no literal periods . in your directory names. Periods are perfectly legal-- Apache itself has them all over the place-- but it saves a ### of a lot of trouble if the only place you ever use a . is as an extension-delimiter. (Also of course in your hostname, but that's not part of the path-- the piece a RewriteRule looks at-- so it doesn't matter.)

5.From your explanation, should my rewrite look like this?
RewriteCond %{THE_REQUEST} index|default
RewriteRule ^(([^/])*)(index|default) http://www.example.com/$1 [R=301,L]

The capture looks like this:
^(([^/]+/)*)(index|default)
This is the minimalist form. It can be used if the elements "index" and "default" never occur anywhere in your URL paths except in the "index.html" or "default.asp" filenames. If you happen to have something like a directory called /index/ or maybe /defaultwidgets/ then you need to include the extension.

6. And why drop HTTP/ from phranque's example?

Again, this is the minimalist version. You only need to match things that actually occur, so you can sometimes save time by being slapdash.


%{THE_REQUEST} is the part you see in your raw logs between one set of quotation marks. In

66.249.84.204 - - [08/Jan/2014:02:03:19 -0800] "GET /favicon.ico HTTP/1.1" 200 1750 "-" "Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google favicon"

The request is the piece
GET /favicon.ico HTTP/1.1

All of these Regular Expressions will match the request in mod_rewrite:

^GET\ /favicon\.ico\ HTTP/1\.1$
GET\ /favicon.ico\ HTTP/1.1

(here with intentional but non-lethal mistake of non-escaped periods)
[A-Z]{3}\ /favicon\.ico\ HTTP/1\.[01]
[A-Z]{3,9}\ /\w+\.ico\ HTTP/1\.[01]
[A-Z]{3,9}\ /[a-z]+\.ico\ HTTP/1\.[01]
favicon\.ico
\.ico
HTTP/1\.1
[A-Z]{3,9}\ /[.\s]+\.\w+

et cetera et cetera. Some of those patterns will also match other requests. When looking at %{THE_REQUEST}, the function of [A-Z]{3,9}/  is to delimit the beginning of the URL path, since it isn't the very first thing in the request line.


Reminder! (You may already know this.) The Request doesn't necessarily mean that the human user asked for the file. That's why a browser is called a User Agent. You ask for a page, and the browser on your behalf asks for all the stylesheets and javascript and images ... and ads and analytics and other things that the human may have no idea they're getting ;) But the Request doesn't include SSIs or php includes or error documents or directory-index files or, most importantly, files resulting from any internal rewrite. Some of these can be excluded with the [NS] flag; others require looking at %{THE_REQUEST}.

phranque




msg:4636323
 6:14 am on Jan 10, 2014 (gmt 0)

And why drop HTTP/ from phranque's example?

Again, this is the minimalist version. You only need to match things that actually occur, so you can sometimes save time by being slapdash.

you typically want your regular expression to be as specific as possible while still matching the intended target(s).

lucy24




msg:4636346
 7:57 am on Jan 10, 2014 (gmt 0)

In practice, the reason you might need HTTP at the end is the same reason you need [A-Z]{3,9} at the beginning. It's to delimit the end of the requested URI. You can't have an escaped \ space at the end of a line (yes, I have tested this, ouch) so you have to include at least some of what comes after the space.

kimmiem




msg:4636731
 7:46 pm on Jan 11, 2014 (gmt 0)

WOW this is a lot of new knowledge. But I feel like I have a much better understanding of the reasoning behind the code now. Is this a good final version?

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/.]+/)*(index|default)\ HTTP/
RewriteRule ^(([^/.]+/)*)(index|default) http://www.example.com/$1 [R=301,L]


RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]


lucy24: Notice I added [^/.] to the capture as you mentioned. Did I add it in the right place? to RewriteCond too? And I removed $.

phranque: Does this combination of both of your suggestions keep the code specific while achieving the objective of one redirect instead or two (my initial problem) and with a more consolidated version than my original?

Was the [NC] dropped from your examples for security reasons? I read that had the potential to be abused.

Is this EXACTLY what I use because I've had enough fun with 500 errors for one site;)

Do you suggest any other additions to .htaccess for security on Wordpress sites? I've read about bot-blocking code etc. Is their any way to block gosh-darned comment spam? And I read to chmod .htaccess to 644, which I did - is it secure now?

Thanks again for your time and help.

kimmiem




msg:4636732
 7:49 pm on Jan 11, 2014 (gmt 0)

One last question: What are the spaces around /([^/.]+/)*(index|default)for? start and end?

lucy24




msg:4636736
 7:58 pm on Jan 11, 2014 (gmt 0)

(index|default)\ HTTP/

If you're including the " HTTP/" element then you must give the exact extension, because it's part of the request string:

(index\.html|default\.asp)\ HTTP

What are the spaces around /([^/.]+/)*(index|default)for? start and end?

That will teach me to make overly long posts. Information gets buried.
In mod_rewrite, there is only one syntactic element, and that's the blank space " ".

RewriteRule
SPACE
pattern
SPACE
target
SPACE
flags

RewriteCond
SPACE
aspect-of-request-you're-looking-at
SPACE
pattern (or selected other code words such as -f)
SPACE
flags


Edit:
include [NC] only when it's truly necessary. When you're typing up the rule, it's only a couple of keystrokes-- a single mouseclick if you're doing a case-insensitive search. But for the computer it's the difference between searching for the exact text
index.html
and searching for
[Ii][Nn][Dd][Ee][Xx].[Hh][Tt][Mm][Ll]

Or, in a more common case:
\.jpg
means just that. But
\.jpg [NC]
doesn't just mean ".jpg" or ".JPG" with an outside on ".Jpg". It means
.jpg
.Jpg
.jPg
.jpG
.JPg
.JpG
.jPG
.JPG
because the computer doesn't know that once you've found .j, you do not have to consider the possibility of PG.

phranque




msg:4636845
 10:38 am on Jan 12, 2014 (gmt 0)

Does this combination of both of your suggestions keep the code specific while achieving the objective of one redirect instead or two (my initial problem) and with a more consolidated version than my original?

that first RewriteCond won't match as described by lucy24

Was the [NC] dropped from your examples for security reasons?

it was done for url canonicalization reasons.
hostnames are case-insensitive, but everything else in the url is case-sensitive.

And I read to chmod .htaccess to 644, which I did - is it secure now?

it partly depends on who "owns" the file and how secure your server is.
it also depends on what response you get when you request http://(www.)example.com/.htaccess
it should be a 4xx status code.

kimmiem




msg:4652391
 2:44 am on Mar 9, 2014 (gmt 0)

Ok it's been a while...and I still need to update that code...

So is this a good final answer? Thanks again.

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/.]+/)*(index\.html|default\.asp)\ HTTP/
RewriteRule ^(([^/.]+/)*)(index|default) http://www.example.com/$1 [R=301,L]


RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

kimmiem




msg:4652399
 2:56 am on Mar 9, 2014 (gmt 0)

You know what? I just went ahead and tried it and it worked! Thanks again for the top-notch help:)

phranque




msg:4652417
 6:36 am on Mar 9, 2014 (gmt 0)

that looks pretty good in general but i haven't reread the thread to see if any specific issues have slipped through.

lucy24




msg:4652421
 7:04 am on Mar 9, 2014 (gmt 0)

I, on the other hand, have just reread the whole thing because it was two months ago and nobody can possibly be expected to remember that far back ;)

Your final two rulesets certainly look clean. I don't think "example.com" and "EXAMPLE.COM" formally count as different domains, unlike with-and-without www. So you can keep the [NC] if it makes you happy.

kimmiem




msg:4652423
 7:31 am on Mar 9, 2014 (gmt 0)

Oh good, I thought it was just me that had to strain a little to remember what it was all about;) Actually I did mean to take off the [NC] after the points you both made. I'll go ahead and do that. I actually made the silly mistake of not changing out 'example' for the real domain, but that only lasted a minute;) classic newb mistake I bet

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved