Welcome to WebmasterWorld Guest from 54.145.235.72

Forum Moderators: Ocean10000 & incrediBILL & phranque

url canonicalization - normalization

is this sufficient?

   
7:00 am on Oct 13, 2008 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Would this be sufficient, or at least a good start:

RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} ^([^.]+\.[a-z]{2,6})$ [NC]
RewriteRule ^(.*)$ [%1...] [R=301,L]

This would be so URL always ends up in its WWW form.

I already saw somewhere that NC would not be a good choice.

Thanks

7:49 am on Oct 13, 2008 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Without knowing what the format of the incoming URLs are, I couldn't possibly comment. Why the
{2,6}
part? I assume that is for matching the
.com
part. You shouldn't end anchor host names in case there is an appended port number. Your code does not allow for subdomains other than the www and does not allow for hyphens in the domain name. Does that fit the specification of what you want it to do?

As for capitalisation, host names can be upper or lower or mixed case and will still refer to the same resource. Case changes in folder or file names are treated as being for a different resource by the HTTP specification and servers such as Apache which follow those specs. (note that M$ servers such as IIS break this rule).

Using

^(.*)$
is over-specified. The
(.*)
will suffice.

As well as fixing the non-www canonicalisation, don't forget the index canonicalisation too.

6:58 am on Oct 16, 2008 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



That was recommended to me in order to resolve “non-www” issue, so all resolves to “www” version. There was no other then “www” background. The code itself is from Perishable Press site.
Here is what they say about those three lines:

RewriteCond %{HTTP_HOST} !^www\. [NC]
This directive is a condition that checks for the presence of the www prefix in the URL. Processing stops here if the URL already contains the www prefix. The [NC] flag renders the string as case-insensitive.

RewriteCond %{HTTP_HOST} ^([^.]+\.[a-z]{2,6})$ [NC]
This directive is a condition that matches the general pattern of a domain name. The regular expression matches any string of valid characters that is followed by a literal dot ( . ) and an alphabetic string containing two to six characters. For example, the common example of a domain name, domain.tld, will be matched by the regex. Likewise, the condition is designed to match any domain name.

RewriteRule ^(.*)$ [%1...] [R=301,L]
This directive is where the actual URL rewriting takes place. Whenever both of the previous conditions prove true, the RewriteRule directs Apache to rewrite the URL such that it includes the www prefix. The ^(.*)$ pattern matches any valid character string proceeding the domain name (and top-level domain). Finally, the [%1...] serves as the pattern for the rewritten URL. The [R=301,L] flag signals that the change is permanent (i.e., 301), and also that this happens to be the last directive in this sequence of Rewrite rules.

In the case of domain I need this for, hyphens are a must.

I see I’ll have to read more about this.

8:36 am on Oct 16, 2008 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



*** RewriteCond %{HTTP_HOST} ^([^.]+\.[a-z]{2,6})$ [NC] ***

This would match example.com but it would not match example.co.uk if I understand this right.

The pattern

!^www\.
is far more simple and to the point (doesn't begin "www.")

*** This directive is where the actual URL rewriting takes place. ***

Technically, this is not a rewrite. It is a redirect.

While this code could "work", it only works for certain input formats. If those exactly match what you are doing then you will never see a problem.

1:01 pm on Oct 16, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



As written, your rule would break if an FQDN was used or if a trailing port number was present. For
example, these non-canonical URLs are all valid, but would not invoke the rule:
example.com./foo
example.com:80/foo
example.com.:80/foo

Because you must match the end of the hostname, the best way to fix it is to look for these optional parameters specifically. At the same time, we can allow hyphens within the domain and address the ".co.uk"-type hostnames that g1smd mentioned:


RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} ^([a-z0-9][a-z0-9\-]*[a-z0-9]\.(co\.)?[a-z]{2,6})\.?(:[0-9]{1,5})?$ [NC]
RewriteRule (.*) http://www.%1/$1 [R=301,L]

This should handle any/all valid domains/hostnames that do not contain a subdomain. If you wish to support adding "www" to subdomains --for example, redirecting foo.example.co.uk to www.foo.example.co.uk-- then you'd need to change the RewriteCond pattern to

RewriteCond %{HTTP_HOST} ^([b]([a[/b]-z0-9][a-z0-9\-]*[a-z0-9][b]\.)+[/b](co\.)?[a-z]{2,6})\.?(:[0-9]{1,5})?$ [NC]

However, this pattern will be slower to process, so I wouldn't use it unless it's actually needed.

Jim

7:39 pm on Oct 16, 2008 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Thanks!

Now, instead of matching characters, how about putting the domain name into the code?

Something like this:

RewriteCond %{HTTP_HOST} ^example.com [NC]
RewriteRule ^(.*)$ http://www.example.com /$1 [L,R=301]

I know I started the thread with universal code example, but now I’m thinking it may be easier for me to grasp it if I put my domain name into it.

From the above, how would the code below look like, if we replace matching characters with real domain name:

RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} ^([a-z0-9][a-z0-9\-]*[a-z0-9]\.(co\.)?[a-z]{2,6})\.?(:[0-9]{1,5})?$ [NC]
RewriteRule (.*) [%1...] [R=301,L]

The point of matching is only to have something universal, right?

Thanks

[edited by: jdMorgan at 7:46 pm (utc) on Oct. 16, 2008]
[edit reason] example.com [/edit]

8:03 pm on Oct 16, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Well, there's a spurious space ahead of "/$1"in that code, so it won't work. But since we need to fix that, we'd might as well fully-canonicalize the "www" version in the same rule:

RewriteCond %{HTTP_HOST} ^example\.com [NC,OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]{1,5})$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

Note that the "universal" rule in my previous post does not canonicalize the "www" domain variants, and we'd need an additional rule to do that if we wanted a truly-universal solution. Something like:

# Canonicalize all non-www domain variants
RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} ^([a-z0-9][a-z0-9\-]*[a-z0-9]\.(co\.[a-z]{2}¦[a-z]{2,6}))\.?(:[0-9]{1,5})?$ [NC]
RewriteRule (.*) http://www.%1/$1 [R=301,L]
#
# Canonicalize all www domain variants
RewriteCond %{HTTP_HOST} ^www\.([a-z0-9][a-z0-9\-]*[a-z0-9]\.(co\.[a-z]{2}¦[a-z]{2,6}))(\.¦\.?:[0-9]{1,5})$ [NC]
RewriteRule (.*) http://www.%1/$1 [R=301,L]

Note that I also tweaked the regex pattern for country-code matching a bit. Just a different way to do it, but I like it better.

Replace the broken pipe "¦" characters in all code above with solid pipes before use; Posting on this forum modifies the pipe characters.

Jim

4:36 am on Oct 17, 2008 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



RewriteCond %{HTTP_HOST} ^www\.example\.com(\.¦\.?:[0-9]{1,5})$ [NC]

What’s this for:

(\.¦\.?:[0-9]{1,5})$

Also, would the universal code be both non-www and www domain variants? In other words, I put the whole thing into .htaccess, not just one?

I guess that part of my (understanding) problem is not just getting the regex part to my brain, but also having a good idea what falls under those incoming URLs that are defined as non-www and www domain variants.

Thanks

8:38 am on Oct 17, 2008 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The extra parts allow the rule to accept URL requests with a trailing dot and/or a port number and then strip those off as the redirect is performed.

If you didn't do that, then all your content could be indexed both with and without a trailing dot on the hostname like www.example.com/yourfile.html and www.example.com./yourfile.html and again both with and without a port number like www.example.com/yourfile.html and www.example.com:80/yourfile.html for every page of your site.

By allowing for those as "inputs" and then removing them at the same time as you make other fixes to the URL, you eliminate the issue of Duplicate Content indexing for any and all pages of your site.

5:15 pm on Oct 17, 2008 (gmt 0)

10+ Year Member



smallcompany:

I use this and works fine for me:

RewriteEngine On

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html? [NC]
RewriteRule ^(([^/]*/)*)index\.html?$ [mydomain.com...] [R=301,L]

[Rewrite /index.html --> / (main page or folders)]

RewriteCond %{HTTP_HOST} ^mydomain\.com [NC]
RewriteRule ^(.*)$ [mydomain.com...] [R=301,L]

[Rewrite mydomain.com --> www.mydomain.com]

I'm not an expert at all. I got this code here in WW.

jdMorgan, g1smd:

Without adding more rules, could I also redirect some wrong links to main page? Links that point, for example, to

www.mydomain.com/default.htm
www.mydomain.com/index.aspx
www.mydomain.com/.
www.mydomain.com/,
www.mydomain.com/%20
www.mydomain.com*/

Also I have links that points to www.mydomain.com/somefolder/page.htm/ (with /) or www.mydomain.com/somefolder/page.htm#*$!(all kind of characters after .htm)

5:45 pm on Oct 17, 2008 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



What you have commented as [Rewrite...] above, is not a rewrite.

The code is correct, but the code for both are for a redirect. That's what the R=301 bit does. Change your note to say Redirect.
.

You can extend the rule that currently caters for index.html and index.htm and make it work for other names; something like this:

(index¦default)\.(html?¦php[45]?¦[aj]spx?)¦cfm)

Notice the question marks and stuff in [] that extends the options, and it now responds to 20 (count 'em!) different names.
.

Yes. You will need one more rule to fix most of the trailing stuff. Several variants have been posted quite a lot recently.

6:10 pm on Oct 17, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



One trailing-stuff fixer was posted five posts before this one...

Jim

4:00 pm on Oct 20, 2008 (gmt 0)

10+ Year Member



Thank you very much :-) I will test it.
6:21 am on Nov 17, 2008 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Hi,

I wonder about improving this code so all non-existing subdomains, including misspells of “www” like ww.site.com or wwww.site.com and so on get caught and redirected to a main domain www.site.com.

Is that possible via .htaacess or it requires something on DNS level?

2:19 pm on Nov 17, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Your DNS has to be set up to point "wild-card" subdomains to your server's IP address, and then the second rule ("Canonicalize all non-www domain variants") in my post #3767371 above should take care of them.

Jim

5:05 pm on Nov 17, 2008 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Thanks.
 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month