homepage Welcome to WebmasterWorld Guest from 54.145.243.51
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 40 message thread spans 2 pages: 40 ( [1] 2 > >     
mod rewrite for domain and subdirectory changes
Redirecting requests from superseded bookmarks
Carob



 
Msg#: 4669640 posted 2:53 am on May 9, 2014 (gmt 0)

To accommodate changes in terminology, I have changed the domain name and various subdirectories in my web site. The old domain name 'oldexample.com' is a parked domain pointing to the same IP address as 'newexample.com', but not having a specified redirection in cPanel (at present).

The .htaccess file I have in the public_html directory is as follows (edited down for brevity):
--------------------
Options +FollowSymlinks
Options -Indexes

<FilesMatch \.[Hh][Tt][AaPpGg].+$">
Order allow,deny
Deny from all
</FilesMatch>

Addhandler application/x-httpd-php5 .html .php

RewriteEngine on
RewriteBase /

# Rule 1
# Block useless bots
RewriteCond %{HTTP_USER_AGENT} ^(.*)Baiduspider(.*) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)YandexBot(.*) [NC]
RewriteRule ^(.*)$ - [F,R=403,L]

# Rule 2
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^newexample\.com$ [NC]
RewriteRule ^(.*)$ [newexample.com...] [R=301,QSA]

# Rule 3
# Redirect to HTTP for non-secure pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteCond %{REQUEST_URI} ^/index.html$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/afa/index.html$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/pf1/index.html$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/pf2/index.html$ [NC]
RewriteRule ^(.*)$ [%{HTTP_HOST}...] [R=301]

# Rule 4
# Redirect those seeking 'als' to 'afa' and 'sf1' to 'pf1' and 'sf2' to 'pf2' and 'sfa' to 'pf2'
RewriteCond %{REQUEST_URI} /(als|sf[12a]) [NC]
RewriteRule /als(.*)$ /afa$1 [NC,QSA,L]
RewriteRule /sf([1-2])(.*)$ /pf$1$2 [NC,QSA,L]
RewriteRule /sfa(.*)$ /pf2$1 [NC,QSA,L]

--------------------

Rule 2 is intended to direct requests from:
www.oldexample.com
oldexample.com
www.newexample.com
to newexample.com, and appears to work.

Rule 3 works, but I include it here only because it may have an influence on the working of Rule 4.

Rule 4 is the problem area. I wish to rewrite requests for /als to /afa and for /als/index.html to /afa/index.html, etc. If the domain of the request is any one of www.oldexample.com, oldexample.com, or www.newexample.com, and the request doesn't specify index.html, then the redirect works. If the domain is newexample.com, the result is a 404 Not Found error, and if index.html is specifically required, then domain/index.html/index.html results.

I have persisted in trying to solve these issues for quite some time, mostly referring to WebmasterWorld as the authoritative source, but I have not succeeded in fixing the .htaccess file. The 'experts' at my hosting service intervene when I have asked, but their changes haven't been any help and they may be out of their depth. A complicating issue is that redirects specified in my .htaccess file seem to be inherited into cPanel after a time, and I don't know at what level the cPanel redirects are called, or whether having redirects in two places is a problem.

I would very much appreciate some help.

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 5:29 am on May 9, 2014 (gmt 0)

# Rule 3
# Redirect to HTTP for non-secure pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteCond %{REQUEST_URI} ^/index.html$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/afa/index.html$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/pf1/index.html$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/pf2/index.html$ [NC]
RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1 [R=301]

But, but, but :: splutter :: why are any of these requests being allowed to pass through?

(a): Users have no business going to /index.html in the first place, so all this should be part of a broader index redirect
(b): With no domain-name canonicalization, and people already typing in "index.html" [NC], it seems as if you're laying yourself open to something like octuplicate content :(

Besides, all those conditions can be collapsed into the body of a single rule involving
^((?:one|two|three)/)?index\.html

Potentially fatal error: Rule 3 as posted is missing the [L] flag. I hope this is a typo. The [R] flag by itself never implies [L].

Rule 4 is the problem area. I wish to rewrite

Rewrite or redirect? The prose and the body of the rule say rewrite, but the # comment says redirect.

Never ever use [NC] in a rule that ends in [L] alone-- unless you're rewriting to a php script that will take care of any needed redirecting.

The wording of #4 makes me uneasy, because it's actually three rules. The RewriteCond applies only to the first. Blank lines (present or absent) have no effect; that's simply how mod_rewrite works. But here I don't see what the RewriteCond is even needed for; the body of the rule contains the same information.

RewriteCond %{HTTP_USER_AGENT} ^(.*)Baiduspider(.*) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(.*)YandexBot(.*) [NC]
RewriteRule ^(.*)$ - [F,R=403,L]

The locutions ^.* and .*$ are never needed unless you're capturing. Otherwise, simply leave off the anchors. Where did R=403 come from? It's technically permissible, but [F] alone has the identical effect; that's what the flag is for. In fact, your server may become unhappy if you try to use both. And [F] carries an implied [L] so that isn't needed either.

Speaking of not needed: QSA is the default behavior unless you've added a new query string and need to retain the old one as well.

On the plus side, the quoted rules seem to be in the right order. Are there more that you didn't quote?

There's more, but it's late and I'm hungry and this will give you something to chew on. Oops. My fingers typed that against my better judgement.

Carob



 
Msg#: 4669640 posted 8:15 am on May 9, 2014 (gmt 0)

Thanks very much, Lucy, for your reply. Plenty to chew on!

The domain-name canonicalisation is attempted in Rule 2.
octuplicate content
I don't understand.

^((?:one|two|three)/)?index\.html
I don't understand "?:" in this context. Does the following rewrite condition do what you suggest:
RewriteCond %{REQUEST_URI} ^/((?|afa|pf[12])/)?(index\.html)? [NC]
Does "?" here represent nothing ("") as alternatives to "afa" and "pf[12]"?
Does "(index\.html)?" at the end answer your concern?

The lack of an [L] flag after Rule 3 is not a typo, but stems from my possible lack of understanding of its use. "The [L] flag causes mod_rewrite to stop processing the rule set. In most contexts, this means that if the rule matches, no further rules will be processed."

However, the conditions requiring Rules 3 and 4 can both occur at once, so even after Rule 3 is completed, I wish for Rule 4 to be checked. I thought having an [L] flag after Rule 3 would preclude Rule 4 being encountered if the rewriting event in Rule 3 was successful. I gather this understanding is not correct; can you please explain further?

Never ever use [NC] in a rule that ends in [L] alone
If the the request in the first RewriteRule of Rule 4 had been for /ALS, would this be rewritten to /afa as required?

The wording of #4 makes me uneasy, because it's actually three rules.
I did at one stage try it as three separate rules, and that didn't work any better. But looking back over that stage of this development, I don't think I had the escapes correct.

On the plus side, the quoted rules seem to be in the right order. Are there more that you didn't quote?
No, there are no more rules. At least I did something right!

The edited Rules 1 to 4 are below, without the preliminaries that remain unchanged:
--------------------
# Rule 1
# Block useless bots
RewriteCond %{HTTP_USER_AGENT} .*Baiduspider.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} .*YandexBot.* [NC]
RewriteRule ^(.*)$ - [F]

# Rule 2
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^newexample\.com$ [NC]
RewriteRule ^(.*)$ http://newexample.com/$1 [R=301]

# Rule 3
# Rewrite to HTTP for non-secure pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
# RewriteCond %{REQUEST_URI} ^((?:one|two|three)/)?index\.html [NC]
RewriteCond %{REQUEST_URI} ^/((?|afa|pf[12])/)?(index\.html)? [NC]
RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1 [R=301,L]

# Rule 4
# Rewrite those seeking 'als' to 'afa' and 'sf1' to 'pf1' and 'sf2' to 'pf2' and 'sfa' to 'pf2'
RewriteRule /als(.*)$ /afa$1 [L]
RewriteRule /sf([1-2])(.*)$ /pf$1$2 [L]
RewriteRule /sfa(.*)$ /pf2$1 [L]

--------------------

Rule 3 contains your suggestion, Lucy, commented out, followed by my interpretation. I hope I'm not quite so far off the mark this time. Again, thanks for your help.

[edited by: phranque at 9:09 am (utc) on May 9, 2014]
[edit reason] unlinked urls/disabled graphic smileys [/edit]

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 8:09 pm on May 9, 2014 (gmt 0)

# RewriteCond %{REQUEST_URI} ^((?:one|two|three)/)?index\.html [NC]
RewriteCond %{REQUEST_URI} ^/((?|afa|pf[12])/)?(index\.html)? [NC]
RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1 [R=301,L]

Never put something in a Condition that can go in the body of the rule. Here, you're forcing the server to evaluate every single request all the time, even though the rule is constrained to a specific set of URLs. Hence

RewriteRule ^((?|afa|pf[12])/)?(index\.html)?$ http://%{HTTP_HOST}/$1 [R=301,L,NC]

Here [NC] is OK because you're redirecting, so everyone will end up on the same page. But more importantly, what's %{HTTP_HOST} ? In the end, isn't there just one hostname? Put that in the target too.

If the the request in the first RewriteRule of Rule 4 had been for /ALS, would this be rewritten to /afa as required?

No, and that's just the point: Any given content should come from only one URL. If someone is in fact requesting /ALS, they should first be redirected (301) to the correctly cased form. Does this happen often? If so, you may need to rewrite to a quick php page that fixes the casing and then issues the 301.

However, the conditions requiring Rules 3 and 4 can both occur at once, so even after Rule 3 is completed, I wish for Rule 4 to be checked. I thought having an [L] flag after Rule 3 would preclude Rule 4 being encountered if the rewriting event in Rule 3 was successful. I gather this understanding is not correct; can you please explain further?

You're correct on the concept but wrong on the implementation. The only time you leave off an [L] (explicit or implicit) is when a group of rules are meant to work as a package. This can only happen when the last rule is set up to execute all the time; otherwise the request will carry on and potentially meet other rules. And I can't think of any situation where you would do this with different categories of rules, such as mixing external redirect with internal rewrite. One or the other.

Right now it's a bit of a tangle. It may help to backtrack and explain in English what the #3 plus #4 package is intended to do. Then we can work out which rule goes where, and how they should all be worded.

.*Baiduspider.*

You don't need this. Just say Baiduspider. Incidentally, Yandex currently seems to be robots.txt compliant, so it shouldn't be necessary to bring out the heavy artillery. If you have personal experience of attempted violations, find the appropriate thread and post about it.

This in turn reminds me: You need a preliminary RewriteRule that says

RewriteRule ^robots\.txt - [L]

This goes before all other RewriteRules. The idea is to allow everyone to see robots.txt so they're got no excuse to say "Well, I wanted to comply but they wouldn't let me see it!" You need an exclusion like this for each module that issues 403s. So if you don't already have a <Files> envelope for robots.txt that says "Allow from all", add one.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4669640 posted 8:22 pm on May 9, 2014 (gmt 0)

i would simplify these 2 rulesets:
RewriteCond %{HTTP_USER_AGENT} .*Baiduspider.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} .*YandexBot.* [NC]
RewriteRule ^(.*)$ - [F]

RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^newexample\.com$ [NC]
RewriteRule ^(.*)$ http://newexample.com/$1 [R=301]



like this:
# baidu and yandex requests or spoofs get a 403 Forbidden
RewriteCond %{HTTP_USER_AGENT} (Baiduspider|YandexBot) [NC]
RewriteRule . - [F]

# hostname canonicalization (use non-www)
RewriteCond %{HTTP_HOST} !^(example\.com)?$ [NC]
RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]

Carob



 
Msg#: 4669640 posted 4:59 am on May 10, 2014 (gmt 0)

Thank you both for your advice.

The .htaccess file now looks like this:

Options +FollowSymlinks
Options -Indexes

<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>

<FilesMatch \.[Hh][Tt][AaPpGg].+$">
Order Allow,Deny
Deny from all
</FilesMatch>

Addhandler application/x-httpd-php5 .html .php

RewriteEngine on
RewriteBase /

# Rule 1
# Block useless bots
RewriteCond %{HTTP_USER_AGENT} (Baiduspider|YandexBot) [NC]
RewriteRule . - [F]

# Rule 2
RewriteCond %{HTTP_HOST} !^(newexample\.com)?$ [NC]
RewriteRule ^(.*)$ http://newexample.com/$1 [R=301,L]

# Rule 3
# Rewrite to HTTP for non-secure pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?|afa|pf[12])/)?(index\.html)?$ http://%{HTTP_HOST}/$1$2 [NC,R=301,L]

# Rule 4
# Rewrite those seeking 'als' to 'afa' and 'sf1' to 'pf1' and 'sf2' to 'pf2' and 'sfa' to 'pf2'
RewriteRule /als(.*)$ /afa$1 [L]
RewriteRule /sf([1-2])(.*)$ /pf$1$2 [L]
RewriteRule /sfa(.*)$ /pf2$1 [L]



However, a request for
http://newexample.com/sfa delivers
"404 Not Found: The requested URL /sfa was not found on this server." The same occurs even if the last rule is modified to:
RewriteRule /sfa(.*)$ http://%{HTTP_HOST}/pf2$1 [L]

So it seems that Rule 4 is not being encountered at all, which was why I had left off the [L] flag from the end of Rule 3 earlier.

My reason for Rule 3 was to avoid non-secure pages being served as https://... to the visitor coming from a secure page or entering a wrongly configured bookmark. For instance, requesting "https://www.oldexample.com/pf2" results in the URL bar displaying "https://www.oldexample.com/pf2" and in the browser window "This Connection is Untrusted ...". Alternatively, requesting "http://www.oldexample.com/pf2" results in the URL bar displaying "http://newexample.com/pf2" as required.

Because the conditions requiring Rules 3 and 4 can be encountered together, I wish to avoid having the event of success in rewriting from either rule stopping the other from being visited.

But more importantly, what's %{HTTP_HOST} ? In the end, isn't there just one hostname?
Over a transition period, there are old and new domain names pointing to the one dedicated IP address and the files there. I have referred to them as "oldexample.com" and "newexample.com". I want content to always be served with "newexample.com" as the host. Note that the "www.oldexample.com" host was not rewritten in the example I gave (second paragraph above) with the "This Connection is Untrusted ..." result when the prefix was "https://...".

A complicating issue is that redirects specified in my .htaccess file seem to be inherited into cPanel after a time, and I don't know at what level the cPanel redirects are called, or whether having redirects in two places is a problem.
When I first raised this post I mentioned the possible complicating issue of the rewrites in my .htaccess file appearing in cPanel as Domain Redirects. I don't understand why parts of my .htaccess file need to be duplicated there, or whether the duplication causes any problems. Can you comment?

Right now it's a bit of a tangle. It may help to backtrack and explain in English what the #3 plus #4 package is intended to do. Then we can work out which rule goes where, and how they should all be worded.
Both a tangle and incorrect functioning in that Rule 4 seems not to be encountered. I have tried to explain the need for both rules to operate. I hope you can help further ...
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 5:30 am on May 10, 2014 (gmt 0)

<FilesMatch \.[Hh][Tt][AaPpGg].+$">
Order Allow,Deny
Deny from all
</FilesMatch>

I skipped this one before. Why is it so complicated? All you actually need is
<FilesMatch "^\.ht">
et cetera, no ending anchor. Internal requests are case sensitive unless you've got a weird server; if someone comes by asking for .HTACCESS a default 404 response will do just fine. In any case it's almost certainly redundant. Unless you have the worst host in the world, this rule will already be present in the config file, where it applies to all requests everywhere unless someone specifically overrides it. Try commenting-out the lines and requesting .htaccess or .HTACCESS in your browser; I seriously doubt you'll get in.

RewriteRule ^((?|afa|pf[12])/)?(index\.html)?$

Uh-oh, no, you've misunderstood something here. My bad too for missing a cut-and-paste typo. What I originally suggested was
^((?:afa|pf[12])/)?index\.html
The locution ?: means non-capturing group. It is used here to avoid nesting captures; the outer group is needed because the whole package including final directory slash may be absent.

I want content to always be served with "newexample.com" as the host.

Then spell it out in the target.

My reason for Rule 3 was to avoid non-secure pages being served as...

Understood. But the fix is not to omit a possibly crucial flag.

So it seems that Rule 4 is not being encountered at all

Rule 3 does not prevent rule 4 from executing. It simply delays it a little. As soon as the browser meets a redirect, it stops dead in its tracks and issues a fresh request. You can see it in logs as two consecutive requests that look identical. (Same thing you see when someone gets the hostname wrong.)

It looks like this:

#3
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?:afa|pf[12])/)?(index\.html)?$ http://www.example.com/$1 [NC,R=301,L]

i.e. if there is a secure request for afa, pf1 or pf2, OR for the root, then redirect to the same page using the ordinary non-secure protocol. This rule concurrently omits an explicit "index.html" if it was present in the original request.

So the browser that originally asked for
https://www.example.com/afa/index.html
is now asking for
http://www.example.com/afa/

There is nothing in this rule that would even potentially apply to the requests covered in rule 4, since this set of rules lists an entirely different group of URLs:
Rewrite those seeking 'als' to 'afa' and 'sf1' to 'pf1' and 'sf2' to 'pf2' and 'sfa' to 'pf2'

I think there's something seriously wrong here. Rule 3 explicitly names /afa/ as part of an URL. So why is Rule 4 rewriting requests for /als/ to /afa/ ? You've now got two different URLs serving the same content.

Are you absolutely positive Rule 4 wasn't supposed to be a redirect? If yes, the whole thing becomes absurdly simple because you just swap the order of rules 3 and 4, making the former Rule 4 just a special case of Rule 3. ("Redirect to these new URLs, and while you're at it, make sure the protocol and port are correct.")

Edit: It is also possible that you've mixed up the "pattern" and "target" sides of a rerwite. It would not be the first time; in fact this may be one of the most commonly overlooked misunderstandings on the part of people who answer questions :(

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4669640 posted 6:16 am on May 10, 2014 (gmt 0)

However, a request for http://newexample.com/sfa delivers
"404 Not Found: The requested URL /sfa was not found on this server." The same occurs even if the last rule is modified to:
RewriteRule /sfa(.*)$ http://%{HTTP_HOST}/pf2$1 [L]

i just noticed this - that won't match in .htaccess (or directory context) with the leading slash.

you want something more like this:
RewriteRule sfa(.*)$ /pf2$1 [L]
Carob



 
Msg#: 4669640 posted 8:14 am on May 10, 2014 (gmt 0)

Progress. Thank you both for your patience in staying on my case.

phranque's last suggestion seems to have fixed the action of Rule 4. But a request to "https://www.oldexample.com/als/index.html" (which needs all of Rules 2, 3 and 4) still produces "This Connection is Untrusted ..." without any rewriting having occurred.

I skipped this one before. Why is it so complicated? All you actually need is <FilesMatch "^\.ht">
The version I had was a direct copy from Jim, but I take your point, Lucy, that your form is more compact, and the belt-and-braces approach is redundant anyway, so it's gone.

What I originally suggested was ^((?:afa|pf[12])/)?index\.html
I didn't understand "?:" in this context, and queried it at the time. Thanks for the explanation. As well as not being captured, does it also mean that "" (nothing), "als", "pf1", and "pf2" are the alternatives?

Are you absolutely positive Rule 4 wasn't supposed to be a redirect?
No, I'm not sure. The site is to advertise to our customers, who are volunteers, the First Aid courses run by a volunteer team, of which I am one. In this field courses change name sometimes. For instance, "als" = "Advanced Life Support" used to have a page of its own, but has been replaced by "Advanced Resuscitation" and "Pain Management" which are addressed on the "afa" = "Advanced First Aid" page.

So Rule 4 is to rewrite or redirect requests for the old pages to the new pages that most closely meet the customers' needs, to avoid use of old bookmarks causing "Not Found" errors. As First Aid qualifications last for three years, it may be that bookmarks up to three years old may be used.

It is also possible that you've mixed up the "pattern" and "target" sides of a rewrite.
I don't think that's the case here.

Here is the updated .htaccess file:
Options +FollowSymlinks
Options -Indexes

<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>

Addhandler application/x-httpd-php5 .html .php

RewriteEngine on
RewriteBase /

# Rule 1
# Block useless bots
RewriteCond %{HTTP_USER_AGENT} (Baiduspider|YandexBot) [NC]
RewriteRule . - [F]

# Rule 2
RewriteCond %{HTTP_HOST} !^(newexample\.com)?$ [NC]
RewriteRule ^(.*)$ [newexample.com...] [R=301,L]

# Rule 3
# Rewrite to HTTP for non-secure pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?:afa|pf[12])/)?(index\.html|page1\.html|page2\.html)?$ [newexample.com...] [NC,R=301,L]

# Rule 4
# Rewrite those seeking 'als' to 'afa' and 'sf1' to 'pf1' and 'sf2' to 'pf2' and 'sfa' to 'pf2'
RewriteRule als(.*)$ /afa$1 [L]
RewriteRule sf([1-2])(.*)$ /pf$1$2 [L]
RewriteRule sfa(.*)$ /pf2$1 [L]


It's a vast improvement on what I started this post with - thanks - but it still needs further work to avoid the "Untrusted Connection" response.

Also, I have further complicated Rule 3 to include pages such as "page1.html" and "page2.html" in addition to "index.html" in the directory "public_html". So the "$2" in the target returns. Lucy, I imagine you have an improved and more succinct way of expressing this requirement, and avoiding passing "index.html".

Carob



 
Msg#: 4669640 posted 9:30 am on May 10, 2014 (gmt 0)

Is this a solution for Rule 3?
# Rule 3
# Rewrite to HTTP for non-secure pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?:afa|pf[12])/)?((?:index|page1|page2)\.html)?$ [newexample.com...] [NC,R=301,L]

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 7:56 pm on May 10, 2014 (gmt 0)

Carob, this is going to exasperate you because it's what I should have insisted on in the first place.

Set aside the current content and wording of rulesets #3 and #4. Instead, spell out in English the end goals that you want to achieve. Because...

The site is to advertise to our customers, who are volunteers, the First Aid courses run by a volunteer team, of which I am one. In this field courses change name sometimes. For instance, "als" = "Advanced Life Support" used to have a page of its own, but has been replaced by "Advanced Resuscitation" and "Pain Management" which are addressed on the "afa" = "Advanced First Aid" page.

So Rule 4 is to rewrite or redirect requests for the old pages to the new pages

See, that's what I thought. You want to redirect, not rewrite. The stuff on the left is the old URL. The stuff on the right is the new URL. So this package of rules will need to go before the existing #3, and in its final form it will need an [R] flag.

But don't just rearrange rules. First spell out exactly what you want to do. Then we fine-tune rules to meet those needs. Make sure you spell out exactly which URLs use https and which use http. You'll need rules in both directions.

Carob



 
Msg#: 4669640 posted 2:42 am on May 11, 2014 (gmt 0)

Within the constraints of being general, I thought I had spelled out what I need. I'm sorry if I have not done that in sufficient detail.

    Rule 2 - Host Name Canonicalisation (use non-www form of new domain)

Examples of the intended operation for Rule 2:
"http(s)://www.oldexample.com" -> "http://newexample.com"
"http(s)://oldexample.com" -> "http://newexample.com"
"http(s)://www.newexample.com" -> "http://newexample.com"
"http(s)://newexample.com" -> "http://newexample.com"

    Rule 3 - Rewrite to HTTP for non-secure pages

The site has an area with information for the team members and this area is accessed by them by logging in, and has SSL encryption - is https. There is also a publicly-accessible area for entering bookings which is https. https is enforced in all these subdirectories by separate .htaccess files.

The bookings entering area is the reason for the SSL certificate in the first place, and the team-accessible area is a by-product.

When visitors, team members or the public, emerge from these areas to http areas that I have mentioned in the .htaccess file I tendered, I want to avoid them seeing "This connection is untrusted ..." by requiring http, not https. I also think it's preferable to have them seeing the correct address and to bookmark that.

I hoped to achieve those things in Rule 3.

Examples of the intended operation for Rule 3:
"https://newexample.com/afa" -> "http://newexample.com/afa/"
"https://newexample.com/afa/index.html" -> "http://newexample.com/afa/"
"https://newexample.com/page1.html" -> "http://newexample.com/page1.html"
"https://newexample.com/afa/page2.html" -> "http://newexample.com/afa/page2.html" (possibly, though not required currently)

    Rule 4 - Redirect those seeking old pages to new pages

Rule 4 is to change requests for superseded pages to requests for currently available pages. I first designed the site in 2009, and there have been many changes in terminology to accommodate since. I expect there would be rather less call for such redirects, but since some customers will only visit once every three years, time enough for some changes to have occurred, they are worth doing. Otherwise customers faced with a 404 might think the team no longer exists.

Examples of the intended operation for Rule 4:
"/als" -> "/afa"
"/sf1" -> "/pf1"
"/sfa" -> "/pf2"

    The present Rules 3 and 4, and even Rule 2, and their performance

Can you please give me an example of what you mean by redirecting these requests in a way that fulfills the needs of Rules 3 and 4? Perhaps use the "als" to "afa" example I explained. Both are or were http pages; the "als" one doesn't exist anymore.

Currently, the only request of all the combinations of "www." or not, "https" or "http", and "oldexample.com" or "new example.com" that produces the right result from Rules 3 and 4 is: "https://newexample.com/als" -> "http://newexample.com/afa/index.html". This request displays the URL "http://newexample.com/afa/index.html" correctly; "correctly" in the sense that the "$2" is back into Rule 3 to suit those other pages like "page1.html" besides "index.html". If the page sought is "index.html" then there must be a better way to write Rule 3 to avoid showing it as you prefer. I just haven't devised the way to do that yet.

Conversely, the only request that appears not to have encountered the .htaccess file at all, for it emerges as it was entered, is for "https://www.oldexample.com/als". This presents the "untrusted" message. I do not know why even Rule 2 has not worked.

The other six combinations correctly display the "afa" page, but have "http://newexample.com/als/" in the address bar.

    Generally

I don't have access to the "httpd.conf" file, so I thought mod_rewrite was the best way to tackle the intended purpose in my .htaccess file in Rule 2. If you feel other approaches to Rules 3 and 4, or even Rule 2, are better, then please feel free to suggest them.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 5:24 am on May 11, 2014 (gmt 0)

htaccess vs config doesn't make any difference; you'd be using mod_rewrite either way. The only difference is in some fine details of pattern formatting. And even then, the real difference is between htaccess on one side and loose in the config file on the other. Rules inside <Directory> sections-- where most config rules would be located-- will generally look the same as in htaccess.

The former rules 4a,4b,4c go before the former rule 3.

RewriteRule ^als(.*)$ http://www.example.com/afa$1 [R=301,L]

RewriteRule ^sf([12].*)$ http://www.example.com/pf$1 [R=301,L]

RewriteRule ^sfa(.*)$ http://www.example.com/pf2$1 [R=301,L]

In each case, the rule now achieves everything at once:
#1 If the request was originally for https, the redirect specifies http.
#2 If the request was originally for any wrong form of the domain name, potentially including a port number, it is now for www.example.com alone.
#3 Whether or not either of #1 and #2 apply, the redirect will ALSO point people to the correct page.

The former rule #3, which will now be rule #4, comes after this. It picks up any incorrectly worded requests for pages other than als or whatever it was you're redirecting from.

Note however that if you've got a fairly small number of directories, it may be appropriate to include the directory name in the body of the rule. For example

RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?:open|public|insecure).*)?$ http://www.example.com/$1 [R=301,L]

and conversely

RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:closed|private|secure).+)$ https://www.example.com/$1 [R=301,L]

If possible, point your internal links to the correct protocol, http or https. Sometimes this will be more trouble than it's worth, especially if you've got shared content across all pages.

Carob



 
Msg#: 4669640 posted 3:42 am on May 13, 2014 (gmt 0)

You are right, Lucy, in that I am finding this very frustrating, and as a volunteer I am spending too much time on this task and neglecting other things I must do.

The .htaccess file now looks like this:
Options +FollowSymlinks
Options -Indexes

<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>

Addhandler application/x-httpd-php5 .html .php

RewriteEngine on
RewriteBase /

# Rule 1
# Block useless bots
RewriteCond %{HTTP_USER_AGENT} (Baiduspider|YandexBot) [NC]
RewriteRule . - [F]

# Rule 2
RewriteCond %{HTTP_HOST} !^(example\.com)?$ [NC]
RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]

# Former Rule 4
# Rewrite those seeking 'als' to 'afa', 'bookings1 to bookings2', 'sf1' to 'pf1', 'sf2' to 'pf2', and 'sfa' to 'pf2'
RewriteRule ^als(.*)$ http://example.com/afa$1 [R=301,L]
RewriteRule ^bookings1(.*)$ https://example.com/bookings2$1 [R=301,L]
RewriteRule ^sf([12].*)$ http://example.com/pf$1 [R=301,L]
RewriteRule ^sfa(.*)$ http://example.com/pf2$1 [R=301,L]

# Former Rule 3
# Rewrite to HTTP for non-secure pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?:afa|pf[12])/)?((?:index|page[12])\.html)?(.*)$ http://example.com/$1$2$3 [R=301,L]

# New Rule 5
# Rewrite to HTTPS for secure pages
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:bookings2|private)(?:/)?)((?:index|page[34])\.html)?(.*)$ https://example.com/$1$2$3 [R=301,L]

I have tried to accommodate the circumstances of having files in the subdirectories, "page[1234]", other than "index.html", and having bookmarks appended. Also, please read "example.com" as being the new domain I referred to as "newexample.com" previously.

https is forced by having separate .htaccess files in the subdirectories concerned. Therefore, perhaps Rule 5 is unnecessary?

With Rules 3 and 4 in the order shown (4 before 3), requests for the team's secure area yield "Not Found", yet requests for "/bookings2/page3" are served "/bookings2/", but not "page3".

If I take out Rules 4 and 5 then the secure area is still "Not Found", so evidently Rule 3 (Rewrite to HTTP for non-secure pages) is wrong.

Retreating back to just having Rules 1 and 2 in place, I find that requests for "https://www.oldexample.com" yield "This Connection is Untrusted ...", as I reported before, but the other permutations of "www." or not, and "old" versus "new" are rewritten successfully.

Perhaps the way ahead is to forget the redirection and design redirection pages for each of the superseded terms? I had thought that redirection using .htaccess might have been the less time consuming and even the nicer way to accomplish the task. Not having access to logs (trying -> "Internal Server Error 500") means I am stumbling around in the dark.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 7:24 am on May 13, 2014 (gmt 0)

...
...
...

At this point I composed a long reply but deleted it all because we're simply going around in circles.

Stop making new rules and instead address these questions:

How many different directories are there, and what are their names? Only the new directories; the redirected ones have already been dealt with.

Within each of those directories, are all files and/or subdirectories to be treated the same, or are there further subdivisions?

https is forced by having separate .htaccess files in the subdirectories concerned

What does this mean? All redirects should happen in a single htaccess-- the one in the root. Supplementary htaccess files are only for directory-specific content such as different expiration times or index settings. Sure, you can also add extra Allow/Deny directives and throw in an htpasswd file. But all requests, in any form, should be regularized up front.

Carob



 
Msg#: 4669640 posted 10:02 am on May 13, 2014 (gmt 0)

(1)
I find that requests for "https://www.oldexample.com" yield "This Connection is Untrusted ...", ... , but the other permutations of "www." or not, and "oldexample" versus "newexample" are rewritten successfully.
I don't understand why this one case misses out on redirection. This issue occurs when only Rules 1 and 2 are active.

(2) I have mentioned previously how there are not only "index.html" files but others I have called, for the example, "page[1234].html". In the "Former Rule 4" I showed:
RewriteRule ^bookings1(.*)$ https://example.com/bookings2$1 [R=301,L]
to rewrite the page for online bookings (a publicly-accessible secure area), and in "New Rule 5" I have tried to interpret your suggestion for the case of the new subdirectory name "bookings2" having files within it called "page3.html" and "page4.html". I used your
and conversely

RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:closed|private|secure).+)$ https://www.example.com/$1 [R=301,L]
and
RewriteRule ^((?|afa|pf[12])/)?(index\.html)?$
Some of these files have anchors which I wish to preserve in the rewriting process. (I wrongly referred to the "anchors" - like "index.html#Fees" - as "bookmarks" earlier.)

My attempt to accommodate all these requirements produced:
# New Rule 5
# Rewrite to HTTPS for secure pages
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:bookings2|private)(?:/)?)((?:index|page[34])\.html)?(.*)$ https://example.com/$1$2$3 [R=301,L]
covering the tops of the two secure hierarchies, "bookings2" and "private", and the files within, "page3.html", "page4.html", as well as "index.html", and anchors.

(3) In the "bookings2" subdirectory, I have a .htaccess file:
RewriteEngine on

SSLOptions +StrictRequire
SSLRequireSSL
SSLRequire %{HTTP_HOST} eq "example.com"
ErrorDocument 403 https://example.com/bookings2/

RewriteCond %{HTTPS} !=on [NC]
RewriteRule ^.*$ https://%{SERVER_NAME}%{REQUEST_URI} [R,QSA,L]

and at the top of the team-secure area, "private", I have a .htaccess file with access-control as well as the above. Perhaps this is not the correct way to do things, but they have worked well since 2009. From reading as much as I could, it seemed this was the way to set https=on for the directories and files further down the hierarchy.

Given the presence of these .htaccess files, I queried the need for "New Rule 5".

(4) There are 18 subdirectories to the public_html directory, and the team-secure directory "private" has two-deep nesting of subdirectories and further access-controls. I have tried to keep the names simple and few in number for the examples; I can extend the rules to cover the additional names once I know what works.

I am still hoping we can achieve that.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 9:55 pm on May 13, 2014 (gmt 0)

Some of these files have anchors which I wish to preserve in the rewriting process. (I wrongly referred to the "anchors" - like "index.html#Fees" - as "bookmarks" earlier.)

Do you mean fragments? You can include those in the target of a redirect; it's the textbook reason for including the [NE] flag*. But there is no way to include fragments in the pattern, because the browser never sends the # part to the server.


* This is in the Apache docs, and I have personally used it.

Carob



 
Msg#: 4669640 posted 12:35 am on May 14, 2014 (gmt 0)

Thanks for your comment.

What about requests for downloads, such as:
/page2/access.php?access_file=consent
I am trying also to capture these and pass them to the target.

Can you please comment on what is wrong with the construction in (1), and in the Rules 3, 4, and 5?
Rules 3, 4, and 5 are based on your suggestions, and I have tried to extend various constructs you have suggested to cover these additional requirements.

I appreciate your advice.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 4:15 am on May 14, 2014 (gmt 0)

What about requests for downloads, such as:
/page2/access.php?access_file=consent

Oh, that's an entirely different thing. The part after a question mark is the query string. By default, RewriteRules quietly ignore and reappend the query; they look only at the path. (The same goes for redirects created with mod_alias.) So the only time you need to say anything at all about the query string is when you've added to it and need to keep the original part. Otherwise it all happens silently and magically in the background.

Carob



 
Msg#: 4669640 posted 1:46 am on May 15, 2014 (gmt 0)

Thank you, Lucy.

I gather from your last reply that I possibly need to rewrite paths like "/page2/access.php" (in the last example) and leave the query string to be "silently and magically" appended?

And I should reorganise "Former Rule 3" and "New Rule 5" by taking out the capturing of the suffix and reinserting this captured string by "$3", and instead use an [NE] flag type construction?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 4:07 am on May 15, 2014 (gmt 0)

You only need [NE] if the target involves a fragment link. (I've once met another situation where [NE] was needed, but it was an obscure, arcane, one-of-a-kind issue. Something involving literal question marks in the query, I think.)

I wasn't kidding when I said that everything involving the query string happens silently and magically. Just pretend it doesn't exist. Unless you want to change an old query string into a new path (prettified URLs) --and it doesn't seem as if you're heading that way.

Carob



 
Msg#: 4669640 posted 9:45 am on May 15, 2014 (gmt 0)

Following your suggestions, I have removed the .htaccess files from subdirectories, just leaving those that define access. The .htaccess file in public_html now defines where HTTP and HTTPS apply, mostly*. Rules 1 and 2 remain as they have been for some time; the following rules follow:
# Former Rule 4
# Rewrite those seeking 'als' to 'afa', 'bookings1' to 'bookings2', 'sf1' to 'pf1', 'sf2' to 'pf2', and 'sfa' to 'pf2'
RewriteRule ^als(.*)$ http://example.com/afa$1 [R=301,L]
RewriteRule ^bookings1(.*)$ https://example.com/bookings2$1 [R=301,L]
RewriteRule ^sf([12].*)$ http://example.com/pf$1 [R=301,L]
RewriteRule ^sfa(.*)$ http://example.com/pf2$1 [R=301,L]

# Former Rule 3
# Rewrite to HTTP for non-secure pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?:afa|pf[12])(?:/)?)?((?:index|page[12])\.html)?$ http://example.com/$1$2 [R=301,L]

# New Rule 5
# Rewrite to HTTPS for secure pages
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:bookings2|private)(?:/)?)?((?:index|page[34])\.html)?$ https://example.com/$1$2 [R=301,L]

Most redirection, and even the passing of #... and query strings, seem to work as intended. However, links to the home page do not, instead producing the error message:
The page isn't redirecting properly. Pale Moon has detected that the server is redirecting the request for this address in a way that will never complete.

Note: (*) Regarding "mostly": Earlier, you asked:
How many different directories are there, and what are their names?
Do I need to include reference to all the files and directories further down the hierarchy in "Former Rule 3" and "New Rule 5"? I don't presently have *all* subdirectories included in the rules, just those at the top of each tree, below which all are expected to be treated similarly.

Thanks for your help.

Carob



 
Msg#: 4669640 posted 11:42 pm on May 15, 2014 (gmt 0)

Further information about the error causing the "The page isn't redirecting properly" message: The error is seen only when requesting the home page; requests from within the site to all other pages work successfully. The links are relative, eg. to the home page: "<a href="../index.html" target="_top">Home</a>".

On the note "Regarding 'mostly'": The mod_rewrite conditions and rules, as they are, don't seem to give rise to any issues caused by the lack of some of the deeper subdirectories from the rule. The question is merely following up Lucy's earlier suggestion.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 1:54 am on May 16, 2014 (gmt 0)

The page isn't redirecting properly.

That's an infinite loop created by an external redirect. An infinite loop created by an internal rewrite would yield an error message from the server; this one's from the browser. Every time the browser puts in a request it starts counting. If the request leads to 10 or 20 or 30 redirects (exact number depends on the browser), it stops asking and instead puts up an error message using its own wording.

It looks as if the culprit is here:
# Former Rule 3
# Rewrite to HTTP for non-secure pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?:afa|pf[12])(?:/)?)?((?:index|page[12])\.html)?$ http://example.com/$1$2 [R=301,L]

# New Rule 5
# Rewrite to HTTPS for secure pages
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:bookings2|private)(?:/)?)?((?:index|page[34])\.html)?$ https://example.com/$1$2 [R=301,L]

All those question marks mean that both rules can apply when the request is for the root, so whether it's secure or non-secure the request will be redirected.

And that's why I keep asking about the actual directory names. It's much easier to first figure out what you need to do and then write rules to achieve it, than to first look at rules and then try to adjust them when they misbehave.

If the root is to be https, make sure the http redirect doesn't include requests for the root. If the root is to be http, make sure the https redirect et cetera.

Anyway, there are way too many parentheses and question marks.

There should be a separate pair of index redirects before the fallback http(s) redirects. If the [NS] flag alone doesn't prevent infinite loops (on my server it's all I need), you'll need a condition looking at THE_REQUEST; otherwise both rules are conditionless:

RewriteRule ^({list-all-secure-directories-here}/)index\.html https://www.example.com/$1 [R=301,L,NS}

RewriteRule ^({list-all-nonsecure-directories-here}/)?index\.html http://www.example.com/$1 [R=301,L,NS}

It doesn't matter which one comes first, since they're mutually exclusive; list them in order of likelihood-to-occur. Here I've assumed that the root is non-secure. If instead it's secure, move the ? over to the https rule.

The [NS] flag means "no subrequest". It's meant to exclude server-internal requests such as SSIs or, here, mod_dir requests. The flag does not cover requests created by mod_rewrite itself, only certain other mods.

Again, this pair of rules goes before the generic http(s) redirect-- which then, in turn, loses its "index.html" component:

RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?:afa|pf[12])/.*)?$ http://example.com/$1 [R=301,L]

meaning: request for anything in afa, pf1 or pf2 directories, or for the root

RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:bookings2|private).*) https://example.com/$1 [R=301,L

meaning: request for anything in bookings or private directories.

Wait, we're not done yet. I deliberately left out the pages you've shown as "page[12]" and "page[34]" because we need more information. If they're specific pages, they are by definition in specific directories. Rules should then list those by name, and the rules should come before everything I've named here:

RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^(full-path-leading-to-page[12])$ http://example.com/$1 [R=301,L]

RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^^(full-path-leading-to-page[34])$ https://example.com/$1 [R=301,L

If the relevant sets of pages each live in the same directory, you can omit it from the capture, like

RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^full-path-here/(page[12])$ http://example.com/full-path-here/$1 [R=301,L]

and same for page[34] if applicable. This saves the server the work of having to hold on to a capture if it turns out the request is for some other page. (Tiny point, not significant unless you get a colossal amount of traffic-- but isn't that what we all want?)

Carob



 
Msg#: 4669640 posted 9:16 am on May 20, 2014 (gmt 0)

Thanks for your detailed explanation. I can see how index.html satisfied two rules, and how that leads to problems. I have tried to follow your suggestions to correct that and other issues. Your guess that the root is a non-secure area is correct.

The directory names and the .html files are:
HTTP Files
----------
(1) index.html:
index.html /
index.html /afa/
index.html /art/
index.html /bookings1/
index.html /classes/
index.html /els/
index.html /faqs/
index.html /fees/
index.html /forms/
index.html /gallery/
index.html /pf1/
index.html /pf2/
index.html /rfa/

(2) 'other'.html:
tour.html /
example.html /
example-sling.html /
banner.html /art/ - The PHP include header of all pages
buttons-row1.html /art/ - PHP include buttons for all pages
buttons-row2.html /art/ - PHP include buttons for all pages
buttons-row3.html /art/ - PHP include buttons for all pages
contacts.html /art/ - The PHP include footer of all pages

HTTPS Files
-----------
(1) index.html:
index.html /admin/
index.html /bookings2/
index.html /private/
index.html /private/history/
index.html /private/reference/
index.html /private/admin/
index.html /private/admin/counter/
index.html /private/admin/file-log/
index.html /private/admin/files/
index.html /private/admin/page-log/
index.html /private/admin/test/
index.html /private/trainers/

(2) 'other'.html:
booking-entry.html /bookings2/
booking-save.html /bookings2/
identity.html /private/history/
bookmarks.html /private/reference/
buttons-row4.html /private/admin/
booking-entry.html /private/admin/test/ - Development version
booking-save.html /private/admin/test/ - Development version
characters.html /private/admin/test/
course-data-read.html /private/admin/test/
links-index.html /private/admin/test/

After the section rewriting superseded directories ("Rewrite those seeking 'als' to 'afa', ..."), the .htaccess file looks like this:
# Rewrite to HTTP for non-secure 'other'.html pages at the root
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((?:tour|example[-a-z]+)\.html)$ http://example.com/$1 [R=301,L]

# Rewrite to HTTP for non-secure 'other'.html pages in /art/
# RewriteCond %{HTTPS} =on [OR]
# RewriteCond %{SERVER_PORT} 443
# RewriteRule ^((?:banner|buttons-row[123]|contacts)\.html)$ http://example.com/art/$1 [R=301,L]

# Rewrite to HTTPS for secure 'other'.html pages in /bookings2/
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:booking-entry|booking-save)\.html)$ https://example.com/bookings2/$1 [R=301,L]

# Rewrite to HTTPS for secure identity.html page in /private/history/
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^identity\.html$ https://example.com/private/history/identity.html [R=301,L]

# Rewrite to HTTPS for secure bookmarks.html page in /private/reference/
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^bookmarks\.html$ https://example.com/private/reference/bookmarks.html [R=301,L]

# Rewrite to HTTPS for secure buttons-row4.html page in /private/admin/
# RewriteCond %{HTTPS} =off [OR]
# RewriteCond %{SERVER_PORT} !443
# RewriteRule ^buttons-row4\.html$ https://example.com/private/admin/buttons-row4.html [R=301,L]

# Rewrite to HTTPS for secure 'other'.html pages in /private/admin/test/
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:characters|course-data-read|links-index)\.html)$ https://example.com/private/admin/test/$1 [R=301,L]

# Rewrite to HTTP for non-secure 'other'.html pages
# RewriteCond %{HTTPS} =on [OR]
# RewriteCond %{SERVER_PORT} 443
# RewriteRule ^((?:art)/.*)?$ http://example.com/$1 [R=301,L]

# Rewrite to HTTPS for secure 'other'.html pages
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((?:bookings2|private/history|private/reference|private/admin|private/admin/test).*) https://example.com/$1 [R=301,L]

# Rewrite to HTTP for non-secure index.html pages
RewriteCond %{HTTPS} =on [OR]
RewriteCond %{SERVER_PORT} 443
RewriteRule ^((afa|als|art|bookings|classes|els|faqs|fees|forms|gallery|pf1|pf2|rfa|unused)/)?index\.html http://example.com/$1 [R=301,L,NS]

# Rewrite to HTTPS for secure index.html pages
RewriteCond %{HTTPS} =off [OR]
RewriteCond %{SERVER_PORT} !443
RewriteRule ^((admin|bookings2|private|private/history|private/reference|private/admin|private/admin/counter|private/admin/file-log|private/admin/files|private/admin/page-log|private/admin/test|private/trainers)/)index\.html https://example.com/$1 [R=301,L,NS]
But there are still problems:

(1) Requesting "http://oldexample.com/private/admin/test/links-index.html" yields "http://example.com/401.shtml" in the URL bar and "Not Found". Replacing "401.shtml" with "private/admin/test/links-index.html" yields "https://example.com/private/admin/test/links-index.html" correctly, but after requiring authorisation for the "http://..." address, and then again for the "https://..." address.

(2) If, after having provided authorisation, I request "http://example.com/private/trainers/", then that page is served as "http://..." not "https://...".

The rewrite for the "/art/" files produces broken padlock issues from having parts of a secure page served as non-secure items, so I have commented out that section of the .htaccess file. Likewise "buttons-row4.html" is very unlikely to be specifically requested, being only a PHP included section in an already secure area.

I would appreciate further help. Thanks for your patience.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4669640 posted 10:28 am on May 20, 2014 (gmt 0)

Requesting "http://oldexample.com/private/admin/test/links-index.html" yields "http://example.com/401.shtml" in the URL bar and "Not Found".

you need to use a tool to check status codes and response headers so you can describe the response your browser is getting instead of what you see when the browser is finished redirecting.
you should look for something like Live HTTP Headers in firefox or equivalent.
most likely you are getting a 401 status code response in the server from the initial request.
your server is most likely incorrectly configured such that the ErrorDocument directive for your custom 401 document refers to an absolute URL.

http://httpd.apache.org/docs/current/mod/core.html#errordocument
Note that when you specify an ErrorDocument that points to a remote URL (ie. anything with a method such as http in front of it), Apache HTTP Server will send a redirect to the client to tell it where to find the document, even if the document ends up being on the same server. This has several implications, the most important being that the client will not receive the original error status code, but instead will receive a redirect status code. This in turn can confuse web robots and other clients which try to determine if a URL is valid using the status code. In addition, if you use a remote URL in an ErrorDocument 401, the client will not know to prompt the user for a password since it will not receive the 401 status code. Therefore, if you use an ErrorDocument 401 directive then it must refer to a local document.

Carob



 
Msg#: 4669640 posted 4:04 am on May 21, 2014 (gmt 0)

Live HTTP Headers v0.17 ->
http://example.com/401.shtml

GET /401.shtml HTTP/1.1
Host: example.com
...

HTTP/1.1 404 Not Found
Date: Wed, 21 May 2014 03:53:03 GMT
Server: Apache
Content-Length: 326
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
----------------------------------------------------------
I have not specified custom error documents, relying on the server's default pages.

It appears the 'Host Name Canonicalisation' rewrite is working to establish the correct domain, but the later rule 'Rewrite to HTTPS for secure 'other'.html pages in /private/admin/test/' (the seventh rule)
RewriteRule ^((?:characters|course-data-read|links-index)\.html)$ https://example.com/private/admin/test/$1 [R=301,L]
is not working correctly.
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4669640 posted 5:54 am on May 21, 2014 (gmt 0)

It appears the 'Host Name Canonicalisation' rewrite is working to establish the correct domain, but the later rule 'Rewrite to HTTPS

How the bleep did THAT happen? The hostname-canonicalization redirect should be the very last redirect, bar none. (Last external redirect, that is. All the internal rewrites come afterward.) Or, in your case, the last pair of redirects. Haven't we been over this already? I was about to point you to an adjoining thread on the subject, when I realized this is that thread :(

I have not specified custom error documents, relying on the server's default pages.

Oh, don't do that. Think of your users. A custom 404 page is the place to point people in the likeliest direction, if only by listing the main directories.

It occurs to me that a custom 404 page helps give the impression that your site has been around for ages so it's only to be expected that some pages will disappear, with no blame attaching to anyone. To be safe, attach a noindex meta to your error documents. Or put them in a roboted-out directory, since the chances are pretty minute that anyone else will link to your 404 page by name!

But wait. If you haven't yet named any error documents, where's the "401.shtml" coming from? That's not a server default but the name of a specific document in a specific location. It either exists or it doesn't.

I'm not ignoring your previous long post. It just happened to arrive on the wrong day of the week.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4669640 posted 7:09 am on May 21, 2014 (gmt 0)

the response when you request http://oldexample.com/private/admin/test/links-index.html is a 302?

Carob



 
Msg#: 4669640 posted 8:30 am on May 21, 2014 (gmt 0)

Thanks for your input, 'phranque'.

With the 'Host Name Canonicalisation' as the last external redirect, as Lucy has suggested, Live HTTP Headers v0.17 ->
http://oldexample.com/private/admin/test/links-index.html

GET /private/admin/test/links-index.html HTTP/1.1
Host: oldexample.com
...

HTTP/1.1 301 Moved Permanently
Date: Wed, 21 May 2014 07:45:54 GMT
Server: Apache
WWW-Authenticate: Basic realm="Example - Members Only"
Location: http://example.com/401.shtml
Content-Length: 247
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
----------------------------------------------------------
http://example.com/401.shtml

GET /401.shtml HTTP/1.1
Host: example.com
...

HTTP/1.1 404 Not Found
Date: Wed, 21 May 2014 07:45:54 GMT
Server: Apache
Content-Length: 326
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
----------------------------------------------------------
http://example.com/favicon.ico

GET /favicon.ico HTTP/1.1
Host: example.com
...

HTTP/1.1 200 OK
Date: Wed, 21 May 2014 07:45:55 GMT
Server: Apache
Last-Modified: Sun, 18 Sep 2011 08:25:00 GMT
Accept-Ranges: bytes
Content-Length: 824
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: image/x-icon
----------------------------------------------------------
The immediate response is '301' as specified "[R=301,L]", not '302'. It apparently then sees the need for authentication, having been pointed to the target "https://example.com/private/admin/test/$1", but doesn't request it for some reason unknown to me.

It then requests "http://example.com/401.shtml", can't find it - no such file (and I accept Lucy's comment that I should take care of that omission), gets a '404' error, and goes on to find the icon successfully.

This 40 message thread spans 2 pages: 40 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved