Forum Moderators: phranque

Message Too Old, No Replies

Grrrr . www and non-www should resolve to the same URL!

but currently do not...

         

JackR

9:06 pm on Oct 23, 2011 (gmt 0)

10+ Year Member



A popular canonicalization checker is reporting that:

•http://www.example.com and http://example.com should resolve to the same URL, but currently do not.


But my .htaccess is as follows:


Options +FollowSymLinks
RewriteEngine on

# Redirect if NOT www.example.com (exactly) to www.example.com
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

# Externally redirect direct client requests for "<any-directory>/index.html" and # "<any-directory>/index.htm" to "<any-directory>/" RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?.*\ HTTP/
RewriteRule ^(([^/]*/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

# Externally redirect to fix up FQDN and appended port numbers
#RewriteCond %{HTTP_HOST} ^example.com(\.|:[0-9]*) [NC]
#RewriteRule (.*) http://www.example.com/$1 [R=301,L]



What could be wrong?

g1smd

9:12 pm on Oct 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Use the Live HTTP Headers extension for Firefox to investigate the problem.

Request
example.com/index.html
and see that you are redirected to
www.example.com/index.html
and then on to
www.example.com/


It is working exactly as you have coded it. This multiple step redirection chain is a big problem.

(There's another flaw that you probably haven't even spotted yet. A www request with appended port number is not redirected at all.)

Your rules are in the wrong order. This is how it should be:

1. Redirect non-www or www request with or without port number and with index.html to the correct www URL without port number and without index.html.

2. Redirect requests where hostname is not exactly www.example.com to www.example.com preserving the requested path name in the redirect.

This avoids the redirection chain.



Other problems:

Literal periods in RegEx patterns should be escaped.

One line of code appears to be partially commented out due to a cut and paste error.

^(([^/]*/)*)index\.html?$

The first * should be + here, otherwise you allow the pattern to match this URL:
example.com///folder///folder/folder///index.html


Redirect if NOT www.example.com (exactly) to www.example.com

Your code in that section is not set for "exactly", instead it is only coded as "begins with".

JackR

9:50 pm on Oct 23, 2011 (gmt 0)

10+ Year Member



Live HTTP headers is reporting that all is fine.

www.example.com
www.example.com/
www.example.com/index.html
example.com
example.com/
example.com/index.html

all redirect to /:


http://www.example.com/index.html

GET /index.html HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive

HTTP/1.1 301 Moved Permanently
Date: Sun, 23 Oct 2011 20:43:46 GMT
Server: Apache/2.2.3 (CentOS)
Location: http://www.example.com/
Content-Length: 247
Connection: close
Content-Type: text/html; charset=iso-8859-1





Is this revised .htaccess correct?:


Options +FollowSymLinks
RewriteEngine on

# Externally redirect to fix up FQDN and appended port numbers
RewriteCond %{HTTP_HOST} ^example.com(\.|:[0-9]*) [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

# Redirect if NOT www.example.com (exactly) to www.example.com
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

# Externally redirect direct client requests for "<any-directory>/index.html" and # "<any-directory>/index.htm" to "<any-directory>/" RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?.*\ HTTP/
RewriteRule ^(([^/]*/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

g1smd

10:06 pm on Oct 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



http://www.example.com/index.html

GET /index.html HTTP/1.1
Host: www.example.com

Try that again with the non-www index.html URL as I showed in the previous example.


You still have the cut and paste error in your code. The literal periods are still not escaped. The * has still not been corrected to + and the other RegEx pattern that should match "exactly" still only matches "begins with".

Your revised code still has three sets of rules. My instructions call for only two rulesets, both of which must be coded exactly as described above. That means the code you have now must be changed to exactly fit that description, not merely by shuffling the order around. Achieve this by combining the first and third rule and listing it first, and modifying the second rule and listing it last.

JackR

10:22 pm on Oct 23, 2011 (gmt 0)

10+ Year Member



Here's the full header:



http://example.com/index.html

GET /index.html HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive

HTTP/1.1 301 Moved Permanently
Date: Sun, 23 Oct 2011 21:18:04 GMT
Server: Apache/2.2.3 (CentOS)
Location: http://www.example.com/index.html
Content-Length: 257
Connection: close
Content-Type: text/html; charset=iso-8859-1
----------------------------------------------------------
http://www.example.com/index.html

GET /index.html HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive

HTTP/1.1 301 Moved Permanently
Date: Sun, 23 Oct 2011 21:18:04 GMT
Server: Apache/2.2.3 (CentOS)
Location: http://www.example.com/
Content-Length: 247
Connection: close
Content-Type: text/html; charset=iso-8859-1
----------------------------------------------------------
http://www.example.com/

GET / HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive

HTTP/1.1 200 OK
Date: Sun, 23 Oct 2011 21:18:04 GMT
Server: Apache/2.2.3 (CentOS)
Accept-Ranges: bytes
Content-Length: 20733
Connection: close
Content-Type: text/html
----------------------------------------------------------





Am I right in thinking the error is in the fact that


http://example.com/index.html

redirects to

http://www.example.com/index.html

then to

http://www.example.com



... whereas



http://example.com/index.html
http://www.example.com/index.html

should both redirect to

http://www.example.com

g1smd

10:49 pm on Oct 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, that's one of several errors in the code. You have created a "redirection chain" and that is not a good thing to have.

If you create rules that do "exactly" what the points in the numbered paragraphs describe you will fix that problem.

1. Redirect non-www or www request with or without port number and with index.html to the correct www URL without port number and without index.html.

2. Redirect requests where hostname is not exactly www.example.com to www.example.com preserving the requested path name in the redirect.

Your existing rules 1 and 3 should be combined to create new rule 1. Your existing rule 2 should be edited to create new rule 2.

JackR

11:00 pm on Oct 23, 2011 (gmt 0)

10+ Year Member



I've re-read your replies and had another attempt.

Third time lucky!:



Options +FollowSymLinks
RewriteEngine on

# Externally redirect to fix up FQDN and appended port numbers
# Externally redirect direct client requests for "<any-directory>/index.html" and "<any-directory>/index.htm" to "<any-directory>/"
RewriteCond %{HTTP_HOST} ^example.com(\.|:[0-9]*) [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?.*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

# Redirect if NOT www.example.com (exactly) to www.example.com
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

g1smd

11:16 pm on Oct 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you make the very last RewriteCond pattern "not exactly" instead of "not begins with" then you can delete the very first RewriteCond and very first RewriteRule in their entirety.

JackR

11:16 pm on Oct 23, 2011 (gmt 0)

10+ Year Member



Please tell me I've got it this time!:

Options +FollowSymLinks
RewriteEngine on

# Externally redirect direct client requests for "<any-directory>/index.html" and "<any-directory>/index.htm" to "<any-directory>/"
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?.*\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

# Redirect if NOT www.example.com (exactly) to www.example.com
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^www\.example\.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]



:)

lucy24

11:41 pm on Oct 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You've got some kind of block on the $ anchor haven't you?

RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$

(There's a reason, which I have forgotten, for allowing the "or nothing at all" option.)

Does %{THE_REQUEST} need the full text? Seems like once you've specified REQUEST at all, a simple

index\.html?\ HTTP

would give the needed information without a needless capture. (I say this after establishing that "\.php" alone is enough to block bad robots while permitting auto-indexing via php.)

Is the ? at the end of your first rule a typo or is it intended to wipe any and all query strings?

JackR

11:52 pm on Oct 23, 2011 (gmt 0)

10+ Year Member



I'm not actually sure about the ? to be honest Lucy. I'm assuming it's sensible to remove it?


Options +FollowSymLinks
RewriteEngine on

# Externally redirect direct client requests for "<any-directory>/index.html" and "<any-directory>/index.htm" to "<any-directory>/"
RewriteCond %{THE_REQUEST} index\.html?\ HTTP
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

# Redirect if NOT www.example.com (exactly) to www.example.com
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]




Still gives the same error:

should resolve to the same URL, but currently do not.



ARRRRRGGGHHH!


:)

g1smd

11:59 pm on Oct 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The question mark in the RegEx patterns allows .html and .htm to match. The question mark on the end of the target URL removes appended query string data. Only you know if that is the right thing to do.

The (.*) previously near the end of the first RewriteCond allowed the pattern to match appended parameters. By deleting it, the rule doesn't work when there are attached parameters. Don't use (.*) though, use [^\ ]+ instead:
index\.html?(\?[^\ ]+)?\ HTTP/

[edited by: g1smd at 12:06 am (utc) on Oct 24, 2011]

JackR

12:05 am on Oct 24, 2011 (gmt 0)

10+ Year Member



I'm even more confused now than when I first posted! After about 50 revisions - all of which fail - I'm almost lost.


I now have this:

Options +FollowSymLinks
RewriteEngine on

# Externally redirect direct client requests for "<any-directory>/index.html" and "<any-directory>/index.htm" to "<any-directory>/"
RewriteCond %{THE_REQUEST} index\.html?(\?[^\ ]+)?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]

# Redirect if NOT www.example.com (exactly) to www.example.com
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]



But it's still failing :(

lucy24

1:33 am on Oct 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: sob ::

Rule #1 as written: grab any request for "index.htm" or "index.html" with optional query string, and redirect to bare directory without query string.

Rule #2 as written: grab any request for hostname beginning in "example.com" and send the whole thing to www.example.com instead.

OK. In each case, what happens when you feed in addresses that are supposed to get redirected? I'm kinda antsy about the naked-directory redirect, because unless you've got a subsequent rewrite, some other module may secretly slip in and reappend an "index.html" to make it work. So then it looks as if nothing has happened when in fact two things have happened. Hence the suggestion for a utility such as "Live HTTP Headers".

Simple experiment, using the name of a nonexistent directory: temporarily comment-out your index-redirect Rule (keep the Condition) and substitute something like

RewriteRule (foobar/)index\.html?$ http://www.example.com/$1? [R=301,L]

You should end up on your 404 page-- but your browser's address bar will tell you where it thinks you are.

JackR

1:41 am on Oct 24, 2011 (gmt 0)

10+ Year Member



All HTTP codes are exactly as expected, so I decided to go back to basics with a textbook-fresh .htaccess as follows:


Options +Indexes +FollowSymLinks
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]

RewriteBase /
RewriteCond %{THE_REQUEST} ^[c-t]{3,9}\ /index\.html?(\?[^\ ]+)?\ HTTP/ [NC]
RewriteRule ^(.*)index\.html /$1 [R=301,L]



Same problem!

It's GOT to be an apache config issue, surely?

JackR

2:39 am on Oct 24, 2011 (gmt 0)

10+ Year Member



Using the Live HTTP headers add-on, I've established the following with 100% certainty:

http://www.example.com/ returns HTTP/1.1 200 OK

http://www.example.com/index.html returns HTTP/1.1 301 Moved Permanently to http://www.example.com/

example.com/ returns HTTP/1.1 301 Moved Permanently to http://www.example.com/

example.com/index.html returns HTTP/1.1 301 Moved Permanently to http://www.example.com/


So the obvious question is this:

How is it possible that http://www.example.com and http://example.com DO NOT resolve to the same URL?



Live version of the .htacess:

Options +FollowSymLinks
RewriteEngine On
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?.*\ HTTP/ [NC]
RewriteRule ^(([^/]*/)*)index\.html?$ http://www.example.com/$1? [R=301,L]
RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]

lucy24

3:22 am on Oct 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is it possible that the "popular canonicalization checker" from your original post simply isn't working properly? If so, the remedy may be to say the ### with it and move on.

You've got a slightly cleaner htaccess in the process, so you haven't really wasted your time.

g1smd

6:50 am on Oct 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Post: 4378485

Remove the
[NC]
flag from the first
RewriteCond
, otherwise the rule doesn't fix incorrectly cased requests.

The second
RewriteCond
pattern should begin
^[A-Z]{3,9}\ /([^/]+/)*index
otherwise it only works for root index requests.

The RegEx pattern in the second RewriteRule is incorrect. Never use
(.*)
at the beginning of a pattern. Use
^(([^/]+/)*)index
here.

The target in the second rule should contain the protocol and domain name.

The rules are in the wrong order. Swap the order. Index first. Non-www last.


Post: 4378502

The pattern
([^/]*/)*
should be
([^/]+/)*
in two places.

The pattern
html?.*\ HTTP/
in the first
RewriteCond
should be
html?(\?[^\ ]+)?\ HTTP/
here. Never use
.*
at the beginning or in the middle of a RegEx pattern.

Remove
[NC]
from the second RewriteCond.

You have
[R=301,L]
in one rule and
[L,R=301]
in the other. While they both do exactly the same thing, you should get into the habit of always using one style. This makes typos easier to spot.

JackR

3:17 pm on Oct 24, 2011 (gmt 0)

10+ Year Member



Thank you g1smd,


With your corrections, I now have the following (hopefully perfect) .htaccess:

Options +FollowSymLinks
RewriteEngine On
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?(\?[^\ ]+)?\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1? [R=301,L]
RewriteCond %{HTTP_HOST} !^www\.example\.com$
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]




EDIT: The checker still reports that

• http://www.example.com and http://example.com should resolve to the same URL, but currently do not.


So it must be broken!

g1smd

8:05 pm on Oct 24, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You can now be sure that your site isn't broken, at least as far as canonicalisation goes.


For clarity, do add a blank line after each RewriteRule and add a
# comment
before each code chunk describing in plain English what it does.