homepage Welcome to WebmasterWorld Guest from 54.196.24.103
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Strip index and force www - is this good code?
picnictutorials




msg:4562117
 1:24 pm on Apr 6, 2013 (gmt 0)


RewriteRule ^(.*/)?index\.html?$ /$1 [R=301,L]
RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteRule (.*) http://www.%{HTTP_HOST}/$1 [R=301,L]


Does that look ok to you? Or is something missing? Thanks

 

g1smd




msg:4562159
 6:55 pm on Apr 6, 2013 (gmt 0)

Lots missing.

You'll have an unwanted two step redirection chain for non-www index requests.

Requests for www.example.com:80 won't be redirected.

Never use (.*) at the beginning or in the middle of a RegEx pattern. It creates a storm of "back off and retry" trial matches.

Without a preceding RewriteCond looking at THE_REQUEST you'll have an infinite redirect loop for index requests.

Every redirect should include the canonical hostname in the target.

This is the usual recommendation:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1 [R=301,L]

RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

lucy24




msg:4562186
 8:32 pm on Apr 6, 2013 (gmt 0)

Without a preceding RewriteCond looking at THE_REQUEST you'll have an infinite redirect loop for index requests.

Only if your RewriteRules also contain an explicit rewrite such as to index.php?blahblah. Otherwise it should be enough-- and is obviously preferable from the server-load point of view-- to append the [NS] flag.

Dideved




msg:4562265
 4:26 am on Apr 7, 2013 (gmt 0)

> Never use (.*) at the beginning or in the middle of a
> RegEx pattern. It creates a storm of "back off and retry"
> trial matches.

Is this perhaps a micro-optimization that we needn't concern ourselves with? Because the Apache documentation frequently uses (.*) in exactly this way, and presumably the authors of Apache know how to make a good rewrite rule.

> Without a preceding RewriteCond looking at THE_REQUEST
> you'll have an infinite redirect loop for index requests.

I just tested it, but there was no infinite redirect loop...

> Every redirect should include the canonical hostname in
> the target.

Do you mean the substitution string should have the full hostname? Can you explain the reasons behind this suggestion? The Apache documentation includes an example to add "www." for any domain name, and the OP's code matches the documentation almost verbatim.

lucy24




msg:4562273
 5:32 am on Apr 7, 2013 (gmt 0)

Apache is one thing. Regular Expressions are another. And there is seldom an absolute toggle between "works" and "doesn't work". More often it's a continuum from "works" = "it doesn't crash" to "works perfectly" = with never a wasted resource. Or, conversely: there might be one right way and dozens of wrong ways.

Most htaccess files barely scratch the surface of things someone, somewhere, might happen to request. And generally that's fine. For example, you don't need to test for requests in the form "blahblah.html/more-stuff-here" or "example.com//directory" ... until the moment you start seeing those requests. Then you have to deal with them.

Specifically:

Forms with (.*) will work, but they are sloppy. I use them regularly in text editing, where all I have to do is go through a few kilobytes of data in a single pass. If it takes longer than expected I'll go talk to the cat or clean the bathroom. But if you've got hundreds of people all scrambling to load a page at the same time, those nanoseconds add up. And then by and by your host starts muttering about how many resources you're hogging, and pushing you in the direction of a more expensive package.

It's hard to appreciate the problem when you're a human looking at the code, because we're looking in two dimensions and can see what's coming up. The computer can't; it doesn't know that, for example, it was supposed to leave room for ".html". At least not until it gets to the end of its capture and then has to backtrack.

RewriteRule ^(.*/)?index\.html?$ /$1 [R=301,L]

I just tested it, but there was no infinite redirect loop...

Do your actual index pages end in "htm" or "html"? If they end in php or asp or whatnot, there is obviously not a problem here. But otherwise you will go around in circles. This category of error will be caught by the browser, not by the server, because a redirect means a fresh request, and each request is an island.

Edit: It may be possible to set up your server so subrequests are globally ignored in all mod_rewrite activity. That would be handy. Can't do it at my end, though.

Do you mean the substitution string should have the full hostname? Can you explain the reasons behind this suggestion?

In the eyes of a search engine,
www.example.com/
example.com/
example.com/index.html
www.example.com/index.html
are four separate pages. If you are absolutely certain that Duplicate Content will not be an issue for your site, then it doesn't matter.

If you do care about Duplicate Content, then you need to get all your visitors on the same page. Literally. One way to achieve this is by making sure that anyone you redirect ends up in the same place.

Conditions expressed as ^www\. or !^www. are fine as far as they go, but they don't cover the rarer case of a request that comes in with a port name at the other end. They also don't cover HTTP/1.0 which doesn't include a hostname at all. Granted, very few humans still use 1.0. The exceptions at this point seem to be mostly proxies. But not all proxies are evil. Some humans don't even have a choice.

Bottom line: to capture all non-standard hostnames, express your Condition as "anything other than exactly the-form-I-want or exactly nothing".

And, as a more general principle: Why do things in a sloppy way when it takes only a few seconds to write a cleaner rule?

Dideved




msg:4562286
 7:40 am on Apr 7, 2013 (gmt 0)

> Forms with (.*) will work, but they are sloppy.

It seems to me that if the goal is to match any string, then matching on any character is the perfect tool for the job.

> those nanoseconds add up.

If the performance difference truly is measured in nanoseconds -- 1 / 1000th of a microsecond -- then this is hands-down a micro-micro-optimization. But I suspect that you used the word "nanosecond" more informally. Has anyone actually run a benchmark to know how big -- or how little -- of a difference it makes? I suspect the difference will be insignificantly small.

> Do your actual index pages end in "htm" or "html"? ...
> otherwise you will go around in circles.

They do, and it didn't. Are you saying that when you tried this yourself, you ended up in a redirect loop?

> Conditions expressed as ^www\. or !^www. are fine as far
> as they go, but they don't cover the rarer case of a
> request that comes in with a port name at the other end.

Granted. The only goal of this particular rewrite rule is to check for and enforce "www.". That's what the OP wanted, if I'm not mistaken. If we also want to redirect requests with port numbers, then we can do that too, but we don't need to hardcode the hostname to do so.

> They also don't cover HTTP/1.0 which doesn't include a
> hostname at all.

If we decide that we can't rely on the hostname, they we have much bigger problems. For example, name-based virtual hosts would be out of the question. Yet they're actually quite common and even favored in the documentation. I think we're long past the point where we can safely rely on the host header.

> Bottom line: to capture all non-standard hostnames,
> express your Condition as "anything other than exactly
> the-form-I-want or exactly nothing".

There are, of course, trade-offs. Not hardcoding in the hostname means the rewrite rule will work on any hostname. That portability and flexibility may be desirable.

> And, as a more general principle: Why do things in a
> sloppy way when it takes only a few seconds to write a
> cleaner rule?

Me thinks we have different ideas about what makes a sloppy rule. And since the Apache documentation is filled with rules that you would consider sloppy, I feel like the burden of proof is on you to show that the authors of Apache have been doing it wrong all this time.

Dideved




msg:4562303
 8:11 am on Apr 7, 2013 (gmt 0)

Although, if you absolutely must account for the rare HTTP/1.0 request, then the Apache documentation has you covered:

RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule (.*) [%{HTTP_HOST}...] [L,R]

g1smd




msg:4562308
 9:00 am on Apr 7, 2013 (gmt 0)

The Apache documentation gives academic examples, but many don't apply to the real world. That example generates a 302 redirect, something you almost never want to do.

The problem with .* is that it reads in the entire input string all the way to the end "match everything, or nothing". The pattern is "greedy", "promiscuous" and "ambiguous". In effect, when you use .* you're saying "capture the remainder of the input, verbatim, there will be nothing to test for after that".

Unless the entire pattern is
(.*) or .* or the pattern ends with (.*)$ then .* is not appropriate. Use a more specific pattern that will parse left to right in one pass.
Dideved




msg:4562310
 9:17 am on Apr 7, 2013 (gmt 0)

I think your description of .* misses an important detail. It will match as many times as possible while still allowing the rest of the pattern to match. That, of course, is why the index.html pattern correctly matches. No doubt .*'s behavior is one of the many facets of regular expressions that a person would need to master, but it's by no means something we need to avoid.

lucy24




msg:4562317
 9:49 am on Apr 7, 2013 (gmt 0)

RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} !^$

How silly! They've made two separate Conditions when a single one would do the same job and more.

if the goal is to match any string, then matching on any character is the perfect tool for the job

Sure. But at least 19 times out of 20-- judging by questions asked in this forum-- the goal isn't to "match any string". It's to match some particular text in some particular environment. For example, when someone says

^(.*)/(.*)/(.*)

they're not aiming for
example.com/(abd/def/ghi)/jkl/mno
or
example.com/()/()/(somethinghere)
or
example.com/(abc)/(def.html)/(more-garbage-here)
all of which fit the pattern.

They're aiming specifically for
example.com/abc/def/ghi
which is most efficiently picked up with
^([^/.]+)/([^/.]+)/([^/]+)
or at most
^([^/.]+)/([^/.]+)/(.*)

There's nothing wrong with .* or .+ at the end of an expression. It's only when it is followed by other stuff that you start getting inefficient.

Writing an efficient Regular Expression isn't a matter of investing three hours to save four. It's literally a few more seconds of your own time. And then only the first time you write it out. After that it's cut-and-paste anyway.

Are you saying that when you tried this yourself, you ended up in a redirect loop?

It doesn't happen on MAMP or on my current live site, so I have to assume there's some blahblah in both config files that prevents mod_dir output from cycling back through mod_rewrite. This makes me uneasy, because that "index.html" request-- the second one-- should pass through all of htaccess again. In fact something must have changed fairly recently, because I used to be able to write rules constrained to \.html$ and they would apply to all requests including /directory/

It will match as many times as possible while still allowing the rest of the pattern to match.

:: sigh ::
Nobody ever said that it "won't work". The point is that it will only work AFTER mod_rewrite-- or whatever entity is using the Regular Expression-- has captured all the way to the end. It then has to backtrack until it finds a match. If the rule is carefully written, no backtracking is necessary.

It's unusual to meet such a vigorous defense of inefficient rules and careless coding. Can't help but wonder if there's a backstory we're not getting.

:: now wandering off to study MAMP config file ::

Dideved




msg:4562319
 10:42 am on Apr 7, 2013 (gmt 0)

> Sure. But at least 19 times out of 20-- judging by
> questions asked in this forum-- the goal isn't to "match
> any string". It's to match some particular text in some
> particular environment. For example ... ^(.*)/(.*)/(.*)

I actually agree with you here that if the goal is to match a single path segment, then it makes perfect sense to match on non-slash characters. But that's not the case for this thread. This seems to be your 1-in-20 case. We want to match any string that ends in index.html. Since the goal is to match any string, then it makes perfect sense to match on any character.

> There's nothing wrong with .* or .+ at the end of an
> expression. It's only when it is followed by other stuff
> that you start getting inefficient.

If we're going to talk about efficiency, then we again need to remind ourselves that we're talking about micro optimizations. For all practical purposes, there isn't any difference in efficiency, and therefore no good reason to avoid .*. If it's the right tool for the job, then we should use it.

picnictutorials




msg:4562346
 12:51 pm on Apr 7, 2013 (gmt 0)

Wow I wish I could tell the winner of this discussion. I'm inclined to follow Dideved as I believe he is the one (*the only one*) that gave me a solution for this problem...


# 301 permanent redirect index.html(htm) to folder with exclusion for addon domains
RewriteCond %{HTTP_HOST} !(addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com)
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.maindomain.com/$1 [R=301,L]

# 301 permanent redirect non-www (non-canonical) to www with exclusion for addon domains
RewriteCond %{HTTP_HOST} !(addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com)
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.maindomain.com/$1 [R=301,L]


I have addon domains. The code directly above makes it so links aimed at the addons dont redirect to the root (main) domain. This works but I have to add the addon domain to the list each time I add a addon domain to my host. Because I have begun adding a lot of them I was looking for a solution to *auto* exclude the addons without having to add them each time.

I posted this in another thread here but you guys just ignored it. I assume because there is no solution using this code. The code in the OP is the only solution I have received thus far. But then you guys here in webmaster forums say it is wrong. Then I ask show me/us whats the write way to do what I need without having to add the addons each time?

Thanks

lucy24




msg:4562438
 8:06 pm on Apr 7, 2013 (gmt 0)

Excellent. Then we'll stand back and let DivideAndConquer explain why it is necessary, appropriate and desirable to name the element "\.com" six times ;)
RewriteCond %{HTTP_HOST} !(addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com)
g1smd




msg:4562439
 8:13 pm on Apr 7, 2013 (gmt 0)

It might be that on the real site, it isn't six .com TLDs but six different country TLDs.

Dideved




msg:4562489
 9:10 pm on Apr 7, 2013 (gmt 0)

> Excellent. Then we'll stand back and let DivideAndConquer
> explain why it is necessary, appropriate and desirable to
> name the element "\.com" six times ;)

Who's DivideAndConquer?

ewwatson




msg:4562542
 11:06 pm on Apr 7, 2013 (gmt 0)

My other username has been disabled apparently. So now this is me again.

"" Who's DivideAndConquer? ""

I believe that's you

Wow the quoting tech in the forum is top notch. .

"" Excellent. Then we'll stand back and let DivideAndConquer explain why it is necessary, appropriate and desirable to name the element "\.com" six times ;)
RewriteCond %{HTTP_HOST} !(addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com|addondomain\.com) ""

Again you guys deflect the question. That's really weird. FYI they are all .coms. OK you guys aren't allowed to piss and moan that you have to answer the same old question again from other newbies. Because I am giving you a new one (repeatedly) and you just keep ignoring it. If you don't know/have the answer then be a man about it and admit it.

Dideved




msg:4562554
 11:28 pm on Apr 7, 2013 (gmt 0)

> > Who's DivideAndConquer?
> I believe that's you

I find that ironic, because the pattern that lucy *thinks* I favor is actually the pattern that we were able to eliminate.

lucy24




msg:4562555
 11:31 pm on Apr 7, 2013 (gmt 0)

If you don't know/have the answer then be a man about it and admit it.

You lookin at me?

Incidentally, the original question-- as reflected in the thread title-- was "strip index and force www". I don't think that fits into any possible definition of "new question".

Wow the quoting tech in the forum is top notch.

Did you try the "quote" button? Or, in the alternative, typing the words [quote] and [/quote] exactly the way one would do in a php/bb forum. You can't do the ="ewwatson" attribution, though; the code here is hand-rolled and dates way back.

There are several ways to disable bracket-coding for individual words. The method I used here also works in php/bb2 but not 3.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved