homepage Welcome to WebmasterWorld Guest from 54.166.122.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
url canonicalization - normalization
is this sufficient?
smallcompany

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3764333 posted 7:00 am on Oct 13, 2008 (gmt 0)

Would this be sufficient, or at least a good start:

RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} ^([^.]+\.[a-z]{2,6})$ [NC]
RewriteRule ^(.*)$ [%1...] [R=301,L]

This would be so URL always ends up in its WWW form.

I already saw somewhere that NC would not be a good choice.

Thanks

 

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3764333 posted 7:49 am on Oct 13, 2008 (gmt 0)

Without knowing what the format of the incoming URLs are, I couldn't possibly comment. Why the
{2,6} part? I assume that is for matching the .com part. You shouldn't end anchor host names in case there is an appended port number. Your code does not allow for subdomains other than the www and does not allow for hyphens in the domain name. Does that fit the specification of what you want it to do?

As for capitalisation, host names can be upper or lower or mixed case and will still refer to the same resource. Case changes in folder or file names are treated as being for a different resource by the HTTP specification and servers such as Apache which follow those specs. (note that M$ servers such as IIS break this rule).

Using ^(.*)$ is over-specified. The (.*) will suffice.

As well as fixing the non-www canonicalisation, don't forget the index canonicalisation too.

smallcompany

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3764333 posted 6:58 am on Oct 16, 2008 (gmt 0)

That was recommended to me in order to resolve “non-www” issue, so all resolves to “www” version. There was no other then “www” background. The code itself is from Perishable Press site.
Here is what they say about those three lines:

RewriteCond %{HTTP_HOST} !^www\. [NC]
This directive is a condition that checks for the presence of the www prefix in the URL. Processing stops here if the URL already contains the www prefix. The [NC] flag renders the string as case-insensitive.

RewriteCond %{HTTP_HOST} ^([^.]+\.[a-z]{2,6})$ [NC]
This directive is a condition that matches the general pattern of a domain name. The regular expression matches any string of valid characters that is followed by a literal dot ( . ) and an alphabetic string containing two to six characters. For example, the common example of a domain name, domain.tld, will be matched by the regex. Likewise, the condition is designed to match any domain name.

RewriteRule ^(.*)$ [%1...] [R=301,L]
This directive is where the actual URL rewriting takes place. Whenever both of the previous conditions prove true, the RewriteRule directs Apache to rewrite the URL such that it includes the www prefix. The ^(.*)$ pattern matches any valid character string proceeding the domain name (and top-level domain). Finally, the [%1...] serves as the pattern for the rewritten URL. The [R=301,L] flag signals that the change is permanent (i.e., 301), and also that this happens to be the last directive in this sequence of Rewrite rules.

In the case of domain I need this for, hyphens are a must.

I see I’ll have to read more about this.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3764333 posted 8:36 am on Oct 16, 2008 (gmt 0)

*** RewriteCond %{HTTP_HOST} ^([^.]+\.[a-z]{2,6})$ [NC] ***

This would match example.com but it would not match example.co.uk if I understand this right.

The pattern !^www\. is far more simple and to the point (doesn't begin "www.")

*** This directive is where the actual URL rewriting takes place. ***

Technically, this is not a rewrite. It is a redirect.

While this code could "work", it only works for certain input formats. If those exactly match what you are doing then you will never see a problem.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3764333 posted 1:01 pm on Oct 16, 2008 (gmt 0)

As written, your rule would break if an FQDN was used or if a trailing port number was present. For
example, these non-canonical URLs are all valid, but would not invoke the rule:
example.com./foo
example.com:80/foo
example.com.:80/foo

Because you must match the end of the hostname, the best way to fix it is to look for these optional parameters specifically. At the same time, we can allow hyphens within the domain and address the ".co.uk"-type hostnames that g1smd mentioned:

RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} ^([a-z0-9][a-z0-9\-]*[a-z0-9]\.(co\.)?[a-z]{2,6})\.?(:[0-9]{1,5})?$ [NC]
RewriteRule (.*) http://www.%1/$1 [R=301,L]

This should handle any/all valid domains/hostnames that do not contain a subdomain. If you wish to support adding "www" to subdomains --for example, redirecting foo.example.co.uk to www.foo.example.co.uk-- then you'd need to change the RewriteCond pattern to

RewriteCond %{HTTP_HOST} ^([b]([a[/b]-z0-9][a-z0-9\-]*[a-z0-9][b]\.)+[/b](co\.)?[a-z]{2,6})\.?(:[0-9]{1,5})?$ [NC]

However, this pattern will be slower to process, so I wouldn't use it unless it's actually needed.

Jim

smallcompany

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3764333 posted 7:39 pm on Oct 16, 2008 (gmt 0)

Thanks!

Now, instead of matching characters, how about putting the domain name into the code?

Something like this:

RewriteCond %{HTTP_HOST} ^example.com [NC]
RewriteRule ^(.*)$ http://www.example.com /$1 [L,R=301]

I know I started the thread with universal code example, but now I’m thinking it may be easier for me to grasp it if I put my domain name into it.

From the above, how would the code below look like, if we replace matching characters with real domain name:

RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} ^([a-z0-9][a-z0-9\-]*[a-z0-9]\.(co\.)?[a-z]{2,6})\.?(:[0-9]{1,5})?$ [NC]
RewriteRule (.*) [%1...] [R=301,L]

The point of matching is only to have something universal, right?

Thanks

[edited by: jdMorgan at 7:46 pm (utc) on Oct. 16, 2008]
[edit reason] example.com [/edit]

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3764333 posted 8:03 pm on Oct 16, 2008 (gmt 0)

Well, there's a spurious space ahead of "/$1"in that code, so it won't work. But since we need to fix that, we'd might as well fully-canonicalize the "www" version in the same rule:

RewriteCond %{HTTP_HOST} ^example\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]{1,5})$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

Note that the "universal" rule in my previous post does not canonicalize the "www" domain variants, and we'd need an additional rule to do that if we wanted a truly-universal solution. Something like:

# Canonicalize all non-www domain variants
RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} ^([a-z0-9][a-z0-9\-]*[a-z0-9]\.(co\.[a-z]{2}¦[a-z]{2,6}))\.?(:[0-9]{1,5})?$ [NC]
RewriteRule (.*) http://www.%1/$1 [R=301,L]
#
# Canonicalize all www domain variants
RewriteCond %{HTTP_HOST} ^www\.([a-z0-9][a-z0-9\-]*[a-z0-9]\.(co\.[a-z]{2}¦[a-z]{2,6}))(\.¦\.?:[0-9]{1,5})$ [NC]
RewriteRule (.*) http://www.%1/$1 [R=301,L]

Note that I also tweaked the regex pattern for country-code matching a bit. Just a different way to do it, but I like it better.

Replace the broken pipe "¦" characters in all code above with solid pipes before use; Posting on this forum modifies the pipe characters.

Jim

smallcompany

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3764333 posted 4:36 am on Oct 17, 2008 (gmt 0)

RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]{1,5})$ [NC]

What’s this for:

(\.¦\.?:[0-9]{1,5})$

Also, would the universal code be both non-www and www domain variants? In other words, I put the whole thing into .htaccess, not just one?

I guess that part of my (understanding) problem is not just getting the regex part to my brain, but also having a good idea what falls under those incoming URLs that are defined as non-www and www domain variants.

Thanks

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3764333 posted 8:38 am on Oct 17, 2008 (gmt 0)

The extra parts allow the rule to accept URL requests with a trailing dot and/or a port number and then strip those off as the redirect is performed.

If you didn't do that, then all your content could be indexed both with and without a trailing dot on the hostname like www.example.com/yourfile.html and www.example.com./yourfile.html and again both with and without a port number like www.example.com/yourfile.html and www.example.com:80/yourfile.html for every page of your site.

By allowing for those as "inputs" and then removing them at the same time as you make other fixes to the URL, you eliminate the issue of Duplicate Content indexing for any and all pages of your site.

Nimzovich

10+ Year Member



 
Msg#: 3764333 posted 5:15 pm on Oct 17, 2008 (gmt 0)

smallcompany:

I use this and works fine for me:

RewriteEngine On

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html? [NC]
RewriteRule ^(([^/]*/)*)index\.html?$ [mydomain.com...] [R=301,L]

[Rewrite /index.html --> / (main page or folders)]

RewriteCond %{HTTP_HOST} ^mydomain\.com [NC]
RewriteRule ^(.*)$ [mydomain.com...] [R=301,L]

[Rewrite mydomain.com --> www.mydomain.com]

I'm not an expert at all. I got this code here in WW.

jdMorgan, g1smd:

Without adding more rules, could I also redirect some wrong links to main page? Links that point, for example, to

www.mydomain.com/default.htm
www.mydomain.com/index.aspx
www.mydomain.com/.
www.mydomain.com/,
www.mydomain.com/%20
www.mydomain.com*/

Also I have links that points to www.mydomain.com/somefolder/page.htm/ (with /) or www.mydomain.com/somefolder/page.htm#*$!(all kind of characters after .htm)

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3764333 posted 5:45 pm on Oct 17, 2008 (gmt 0)

What you have commented as [Rewrite...] above, is not a rewrite.

The code is correct, but the code for both are for a redirect. That's what the R=301 bit does. Change your note to say Redirect.
.

You can extend the rule that currently caters for index.html and index.htm and make it work for other names; something like this: (index¦default)\.(html?¦php[45]?¦[aj]spx?)¦cfm)
Notice the question marks and stuff in [] that extends the options, and it now responds to 20 (count 'em!) different names.
.

Yes. You will need one more rule to fix most of the trailing stuff. Several variants have been posted quite a lot recently.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3764333 posted 6:10 pm on Oct 17, 2008 (gmt 0)

One trailing-stuff fixer was posted five posts before this one...

Jim

Nimzovich

10+ Year Member



 
Msg#: 3764333 posted 4:00 pm on Oct 20, 2008 (gmt 0)

Thank you very much :-) I will test it.

smallcompany

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3764333 posted 6:21 am on Nov 17, 2008 (gmt 0)

Hi,

I wonder about improving this code so all non-existing subdomains, including misspells of “www” like ww.site.com or wwww.site.com and so on get caught and redirected to a main domain www.site.com.

Is that possible via .htaacess or it requires something on DNS level?

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3764333 posted 2:19 pm on Nov 17, 2008 (gmt 0)

Your DNS has to be set up to point "wild-card" subdomains to your server's IP address, and then the second rule ("Canonicalize all non-www domain variants") in my post #3767371 above should take care of them.

Jim

smallcompany

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3764333 posted 5:05 pm on Nov 17, 2008 (gmt 0)

Thanks.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved