Forum Moderators: phranque

Message Too Old, No Replies

Mod rewrite, canonicals, and capitalisation in the domain name

Is it possible to force lower case in domain name and tld

         

bcrbcr

3:49 pm on May 8, 2007 (gmt 0)

10+ Year Member



This message was posted in another section of Webmaster world by mistake >

I have been working via Webmasterworld pages on getting some of my sites correctly canonicalised - some great advice from jpMorgan, then I picked up a thread on the /index rewrite to start cutting off the enemy at the pass. Best forum around for this stuff.
The following is my current (simple) Mod rewrite, and I am still confused as to why the capitalisation in the domain doesn't get forced to lower case.

I assumed that www.EXAMPLE.COM would be forced to www.example.com - doesn't seem to work that way.

Rule 1 below deals with the index.htm(l) problem nicely, Rule 2 with the www. vs. non-www very well, and in combination nthey also work.

But I still have capitalisation issues - I don't mean inside the site or with indiviudal URls - different issue. I mean the capitalisation of the domain name and tld

I have read and think I understand the capitalisation discussion on webmasterworld, so I don't think that is the issue. I have also messed with adding [nc] to different lines and still can't make headway.

Options +FollowSymlinks
RewriteEngine on
rewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html?\ HTTP/ [nc]
rewriteRule ^index\.html?$ http://www.example.com/ [R=301,L]
rewriteCond %{HTTP_HOST}!^www\.example\.com$
rewriteRule (.*) http://www.example.com/$1 [R=301,L]

Any ideas from the experts? Have I misunderstood how far you can go here?

Thanks

jdMorgan

8:22 pm on May 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



By definition, domain names are sent by all normal browsers and 'bots as all-lowercase, no matter what you type and no matter what on-page link capitalization is used. The capitalization fix is only needed for rogue user-agents. But is it possible that the browser forcing all-lowercase is interfering with your interpretation of your test results?

I suggest using the Live HTTP Headers extension to Firefox to actually observe the HTTP request and response headers. This is not only a great test tool, but also a good learning tool. It allows you to see the raw requests your browser sends, and to view the server response headers.

Jim

[added]
Do not end-anchor the domain in this line:
rewriteCond %{HTTP_HOST} !^www\.example\.com$
the line should read either
RewriteCond %{HTTP_HOST} !^www\.example\.com
-or-
RewriteCond %{HTTP_HOST} !^www\.example\.com(:[0-9]+)?$
This will prevent the rule from failing if a port number is appended to the domain -- a perfectly-valid possibility.
[/added]

[edited by: jdMorgan at 8:25 pm (utc) on May 8, 2007]

bcrbcr

9:47 pm on May 8, 2007 (gmt 0)

10+ Year Member



Jim
Thanks again for your reply - very useful and I will adjust as you suggest.

I also posted this in error in another area
[webmasterworld.com...]
and had some interaction with AjiNIMC and Tedster.

I have been reading your material
[webmasterworld.com...]
A guide to fixing duplicate content & URL issues on Apache

and was concerned with caplitalisation - I now understand (I think) that the main capitalisation issue is after the "/", not in the domain name.

So www.MyFavouriteDomain.com (which looks better in print for humans than www.myfavouritedomain.com) is fine to use, and would resolve with any lower case/uppercase mixture anyway, but

www.MyFavouriteDomain.com/NextPage.htm

needs to be converted to a lower case version such as

www.MyFavouriteDomain.com/nextpage.com

when designing or naming pages.

My last remaining question (in the other thread) is why, for example, www.google.COM would produce a PR of zero, whereas www.google.com produces a PR of 10 (in the google toolbar) - this apparent removal of PR also occurs in my own domain.

Do you have any thoughts? AjiNIMC suggested a possible bug in the google toolbar.

Thanks again
Bryan

jdMorgan

10:30 pm on May 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




Options +FollowSymlinks
RewriteEngine on
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html?\ HTTP/ [NC]
RewriteRule ^index\.html?$ http://www.example.com/ [R=301,L]
#
RewriteCond %{HTTP_HOST} !^www\.example\.com(:[0-9]+)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

Having cleaned this up and corrected the mixed-up case on directives and flags, it can be seen that if any hostname is requested that is not EXACTLY "www.example.com", a redirect will be generated to www.example.com. The only allowed exception is if a port number is appended, which can occur, although it's rare with non-SSL sites. So if a client does manage to request an uppercase or mixed-case hostname, it will be redirected/corrected.

It's good to worry about canonicalization, but the above two rules will take care of 99% of actually-common problems.

Other things to look at:

If (and only if) your site uses static URLs, you may want to remove spurious query strings from requests.

If you see a lot of bad links from forums that can include a period following an auto-linked URL at the end of a sentence, then you might want to address those, too.

One example is [webmasterworld.com....]
Another is [webmasterworld.com...] <- Note that the periods are auto-linked by the forum software, and will appear in the URL.

You posted that you'd read the "canonicalization guide" thread in our forum library; These issues are covered in that thread.

Jim