Canonicalization, The Pros and Cons of Redirecting Various Typins to Root

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Canonicalization, The Pros and Cons of Redirecting Various Typins to Root

Help Me Decide

AjiNIMC

6:24 am on May 2, 2007 (gmt 0)

Redirection from /index.html to /. Example if someone types http://www.example.com/index.html we will take them to http://www.example.com/.

Why I might not do it

It is very much natural
Almost 99% of the website will never do it, so it is a problem at google’s end, they need to solve it.
Very important, when a person is coming to index.html I am redirecting it to / which is making my customer wait.

Why I might do it?

Only from links point of view (avoiding people to link to / and /index.html) and to help Google (may be google will consider the links to / differently from /index.html). SEO benefits (may be for some search engines which cant do URL Canonicalization at its end).
Technically both these files can be different too.

What does Google do?
They will take the most appropriate URL, which is with / almost 99% of the times. I have never seen a duplicate content issue so far due to this. For www and non-www google has an option under webmaster console.

What is your opinion about it?

Thanks,
AjiNIMC

Robert Charlton

9:00 am on May 2, 2007 (gmt 0)

AjiNIMC - I'd suggest the index.html is an extremely important issue to take care of... index.html dupes can be very messy indeed.

I'll just address one of your points...

Very important, when a person is coming to index.html I am redirecting it to / which is making my customer wait.

You need to do a server side 301 redirect, which is virtually instantaneous. Your customers will never notice.

Possibly, though, you may be thinking of a client side (browser) redirect, perhaps a meta refresh redirect. These are entirely different from proper 301s... not the same thing at all.

[edited by: Robert_Charlton at 9:01 am (utc) on May 2, 2007]

activeco

11:05 am on May 2, 2007 (gmt 0)

Very important, when a person is coming to index.html I am redirecting it to / which is making my customer wait.

You don't have to do it. Make index.html as the canonical one and eventually redirect the other way, or forbid duplicate ones with the robots.txt.

The point is that only one uri should be authoritative, regardless of which one.

AjiNIMC

12:37 pm on May 2, 2007 (gmt 0)

AjiNIMC - I'd suggest the index.html is an extremely important issue to take care of... index.html dupes can be very messy indeed.

Can I see one such example, may you can pm me. I have never seen such a problem. I have always Google taking care of it.

Thanks for the replies.

Marcia

1:04 pm on May 2, 2007 (gmt 0)

>>Can I see one such example, may you can pm me. I have never seen such a problem.

No, because after it caused a disaster at MSN I've 301'd index.html or index.htm to the site root/ so no one will be able to see it, it'll redirect. You'll just have to take my word for it that exactly that very issue caused a couple of sites to totally bomb because the engine got it wrong (not Google, but why play with fire - MSN Search sends BUYERS). Besides which, I don't lift my skirt in front of strangers, and I don't know you. ;)

>>I have always Google taking care of it.

No, it isn't always taken care of, and for very good reasons. index.html may or may not be the root of a site, they are not necessarily the same page and often aren't. The root could be index.php or default.htm or whatever - and don't think for one minute that there haven't been people who spammed the h*ll out of the engines with multiples, using massive numbers of IBLs with different anchor text.

Let's not expect the crawlers to guess what we mean, let's be specific on what the root of a site is so it's made very clear. That's up to us, on our end.

http://www.example.com/ may or may not be the same as http://www.example.com/index.html or
http://www.example.com/index.htm - or any number of variations, which are all different pages.

I've had an instance where MSN Search *lost" the root of a site of mine and instead indexed http://www.example.com/index.html which totally messed things up - there was not one link in the world to index.html and what I had to do was 301 from that to the actual root. It killed the site, because the internal navigation structure and IBLs were totally skewed on their end because of the mistake. The site has never fully recovered, and it's a huge Q4 traffic loss.

Rule #1: Never confuse the bot or make her wonder, think or second-guess. She is not a mind reader, able to second-guess what the webmaster's intentions are.

[edited by: Marcia at 1:33 pm (utc) on May 2, 2007]

Marcia

1:52 pm on May 2, 2007 (gmt 0)

They will take the most appropriate URL, which is with / almost 99% of the times. I have never seen a duplicate content issue so far due to this. For www and non-www google has an option under webmaster console.

No, they will NOT take the "most appropriate" URL. Which would that be, among myriad possibilities? That's an unrealistic, unfounded assumption with no basis other than supposition. And why should they? The different URLs can and often are different pages. Assuming they'll take the most appropriate URL is a pipe dream at best, and totally unrealistic.

I most certainly have seen a duplicate content issue with this, and with good reason. They're not the same URL, and URLs are each indexed with a unique DocID for each URL. Read the white papers and the patents for confirmation.

Additionally, the average Joe Shmoe's webmaster generally knows nothing about Google's Webmaster Console, maybe never even heard of it. They're generally totally clueless and just do the index.htm thing for the sake of the ease of running the site in a development environment and then transitioning it to the live site.

The developer's "convenience" is not Google's concern or problem, it never was, still isn't, and never will be.

[edited by: Marcia at 2:05 pm (utc) on May 2, 2007]

MrStitch

2:04 pm on May 2, 2007 (gmt 0)

I've 301ed my index recently. The index.html file was changed to a php. For a little bit there I was fine, with no 301.

Then recently I bombed hardcore, and getting worse. Since then I 301ed to fix the issue.... at least, I HOPE that is the problem.

We'll just see....

europeforvisitors

2:15 pm on May 2, 2007 (gmt 0)

In March, 2005, I lost 70% of my Google referrals, apparently because of the "canonical issue." Soon after that, Google referrals were down by 90%.

I fixed the problem with a 301 redirect (at the suggestion of several members here), and my traffic came back two months later.

Having gone through that experience, I'm inclined to think that it's better to be proactive than reactive. Fix the problem before it becomes a problem, and you may save yourself a nasty surprise.

theBear

2:37 pm on May 2, 2007 (gmt 0)

You can do as you wish, the rest of us who remove the doubt about which pages are the ones will be happy to take your traffic.

Late 2004 and seesawing almost off the map in 2005 has left me with zero doubt about the ability of any search engine being able to figure out the mess. They may in some cases try, however they fail more than a small fraction of the time.

AjiNIMC

5:03 pm on May 2, 2007 (gmt 0)

sites to totally bomb because the engine got it wrong (not Google, but why play with fire - MSN Search sends BUYERS)

I agree, MSN - I cant trust but there are many who uses it.

Besides which, I don't lift my skirt in front of strangers, and I don't know you. ;)

:), words taken, understood the depth and regret for being a str(anger).

No, it isn't always taken care of, and for very good reasons.

[mattcutts.com...]
[mattcutts.com...]
he says, "When Google 'canonicalizes' a url, we try to pick the url that seems like the best representative from that set."

and don't think for one minute that there haven't been people who spammed the h*ll out of the engines with multiples, using massive numbers of IBLs with different anchor text.

This still remains a big problem, I wish someone doing this to me and then I will setup the redirection, a lot of links in bonus. A lot of links to 2 urls can make it different, may be someday we can do some experiments.

Rule #1: Never confuse the bot or make her wonder, think or second-guess. She is not a mind reader, able to second-guess what the webmaster's intentions are.

Thanks, decided to go for a Canonicalization with index.html as well.

Thanks theBear, europeforvisitors, MrStitch, Marcia, activeco and Robert Charlton for the replies. I am going to do it now :). Thanks for convincing me with scary real life examples. May God keep all away from Search Engine bugs.

g1smd

12:47 am on May 3, 2007 (gmt 0)

If all of your internal pages link back to the root as "/index.html" and all of your external incoming links from other sites are pointing at "www.domain.com/" then you have split your PR.

You are already in trouble in much the same way that www and non-www cause problems.

You should always link back to "/" from within your site. To make things tidy I always redirect (index¦default¦home)\.(html?¦php¦asp¦cfm¦jsp) back to "/" too, with a 301 redirect.

You can change the technology that runs the website at any time without exposing any new URLs to the outside world. Additionally you make it harder for people to know what technology actually runs the site.

activeco

4:02 pm on May 3, 2007 (gmt 0)

To make the things more clear, let me add that there is no such thing as "root file".
Root ("/" on Unix systems) is the top-most DIRECTORY, as defined by DocumentRoot directive (in Apache).
When a user requests a directory, what will be returned depends on a few server settings. In Apache, if you want to provide a default file for a directory, the DirectoryIndex directive must be set:
DirectoryIndex MyFile.html

or even: DirectoryIndex SomeOtherDir/Script.pl

So, if you don't know what you're doing, redirecting files back to root could cause loops and additional problems.
IMO, the simplest, fastest and most secure way is to set default root file and exclude it in robots.txt.

Robert Charlton

5:22 pm on May 3, 2007 (gmt 0)

Make index.html as the canonical one...

In my experience, links to your site (ie, to the default canonical) are going to tend to be in the form www.example.com/ no matter what you do, so you are creating a problem by making index.html your canonical.

...So, if you don't know what you're doing, redirecting files back to root could cause loops and additional problems. IMO, the simplest, fastest and most secure way is to set default root file and exclude it in robots.txt.

Agreed that the redirection, as well as the setup of the default root file, most definitely needs to be done by someone who knows what he or she is doing. When I get to index.html rewrites, I have someone else set up the server and do the rewrites for me.

But setting up a default root file, of, say, index.html, continuing to link to it and excluding it in robots.txt, IMO, is going to continue to split your PageRank and isn't going to cure your display problem... and it encourages potential links to the index.html form, further complicating the PageRank split.

activeco

5:40 pm on May 3, 2007 (gmt 0)

But setting up a default root file, of, say, index.html, continuing to link to it and excluding it in robots.txt, IMO, is going to continue to split your PageRank and isn't going to cure your display problem...

No, index.html (or any other DirectoryIndex file) is transparent to robots and is associated with root directory. All refering links are still pointing to /. To escape duplicate content issue, straight access to the real file, in this case index.html, is disallowed in robots.txt.

AjiNIMC

5:43 pm on May 3, 2007 (gmt 0)

I have done some more testing with inurl:.com/index.html searches in Google. Also I have seen some possible duplicate content problems. Since I can't post the urls here I will be making a blog post (may be this weekend).

I have not seen any site having two cache one for index.html and another for / but I have seen site having different caches for?id=1 and?id=2 (though they go to the same page). I am doing some experiments to confirm it, it might take some 5 to 6 days.

To what level of canonicalization will we go? Google is the smartest website I have seen, perfect canonicalization done, no other search engine has done it. Try?id=1 for google and try the same for Yahoo or Ask or MSN.

g1smd

9:57 pm on May 3, 2007 (gmt 0)

I see many sites with their root page listed both as www.domain.com and as www.domain.com/index.html - and that is a problem.

Agreed, the root of a site is at "/" and that is what should be linked to, not a named index file. Linking to a named index file is asking for trouble.

arnarn

10:45 pm on May 3, 2007 (gmt 0)

g1msd,

I just saw your recent post under the Google canonicalization thread and saw your rewrite example.

You indicated
(index¦default¦home)\.(html?¦php¦asp¦cfm¦jsp)

I'm pretty rusty on rewrites as well as regular expressions, but shouldn't the rewrite be

(index¦default¦home)\.(htm?¦php¦asp¦cfm¦jsp)

Keniki

11:43 pm on May 3, 2007 (gmt 0)

Yes I think its important to only serve the same content one way unless you take other steps to make sure the duplicate version of page can not be indexed. Some people are experimenting in showing different content on /index.html and http:// "no www" often using them as 404 pages and I hear they are getting some success. Of course they are serving different content... spammy no clever possibly!

g1smd

11:52 pm on May 3, 2007 (gmt 0)

html? accepts either htm or html as a valid name.

htm? accepts either ht or htm as a valid name.

All of those URLs, if requested, issue an external 301 redirect (not a rewrite) to the / version of the URL.

bcrbcr

12:16 pm on May 8, 2007 (gmt 0)

I have been working via Webmasterworld pages on getting some of my sites correctly canonicalised - some great advice from jpMorgan, then I picked up a thread on the /index rewrite to start cutting off the enemy at the pass. Best forum around for this stuff.

The following is my current (simple) Mod rewrite, and I am still confused as to why the capitalisation in the domain doesn't get forced to lower case.

I assumed that www.EXAMPLE.COM would be forced to www.example.com - doesn't seem to work that way.

Rule 1 below deals with the index.htm(l) problem nicely, Rule 2 with the www. vs. non-www very well, and in combination nthey also work.

But I still have capitalisation issues - I don't mean inside the site or with indiviudal URls - different issue. I mean the capitalisation of the domain name and tld

I have read and think I understand the capitalisation discussion on webmasterworld, so I don't think that is the issue. I have also messed with adding [nc] to different lines and still can't make headway.

Options +FollowSymlinks
RewriteEngine on
rewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html?\ HTTP/ [nc]
rewriteRule ^index\.html?$ http://www.example.com/ [R=301,L]
rewriteCond %{HTTP_HOST}!^www\.example\.com$
rewriteRule (.*) http://www.example.com/$1 [R=301,L]

Any ideas from the experts? Have I misunderstood how far you can go here?

Thanks

AjiNIMC

2:18 pm on May 8, 2007 (gmt 0)

I assumed that www.EXAMPLE.COM would be forced to www.example.com - doesn't seem to work that way.

Can be done I guess as the request is coming to your server. I will get back home and will try on my site and will put the code here.

bcrbcr

3:20 pm on May 8, 2007 (gmt 0)

Thanks for your help
B

AjiNIMC

5:51 pm on May 8, 2007 (gmt 0)

I assumed that www.EXAMPLE.COM would be forced to www.example.com - doesn't seem to work that way.

This is already considered same, look at the global definition by ICANN ( icann.org/riodejaneiro/idn-topic.htm )

example.com
Example.com
eXample.com
exaMple.com
examPle.com
exampLe.com
examplE.com
EXAMPLE.com
ExAMPLE.com
EXaMPLE.com
EXAmPLE.com
EXAMpLE.com
EXAMPlE.com
EXAMPLe.com
etc.
In the languages that utilize Latin characters (e.g., English, Finnish, German, Italian, etc.), each letter has two variants: upper case and lower case. The Internet's basic DNS and hostname specifications provide that the upper-case and lower-case variants of each letter are considered to be equivalent. Thus, all the variant domain names in the above list are treated as the same domain name.

I hope this answers your query.

Regards,
AjiNIMC

bcrbcr

6:19 pm on May 8, 2007 (gmt 0)

Thanks for your reply - but it doesn't fully answer my question.

If I use the Google toolbar as an approximate measure of relative strength / popularity of a domain (and I know this has weaknesses as an approach, but follow for now...)

www.example1.com has a PR of 10 (let's say, Y for example)
www.example2.com has a PR of 10 (let's say, G for example)

however, with capitals
www.example1.COM has a PR of zero (in Y's case)
www.example2.COM has a PR of zero (in G's case)

Is this significant? Well yes, I think so - because we are all desperately chasing rankings and PR numbers, and any reduction from 3 to 2 or from 5 to 4 would start some sort of seizure amongst most of us - not to mentions the hours of mailas and queries on this site.

This factor (capitalised tld) causes PR to disappear completely.

In the case of my own sites, the same thing happens - I am talking about Apache served sites here, not IIS.

My question is - bearing in mind that this appears to be a canonical issue - how do you reduce your principal URL to one single canonical URL given this tld capitalisation factor

I am sure this has been covered in these pages, but can't find the solution.

Assuming that the Google toolbar is reflecting SOMETHING about ranking, and about duplication/canonicals etc this eeems to be a situation worth addressing. Or is this some sort of illusion and can be safely ignored?

Perhaps the ICANN definition is fine for general programming - but do all servers follow this code?

I would accept that the domain name variations appear not to affect the immediate PR reported by the toolbar - I suppose my concern is that these are all variants / aliases or whatever, that diperse or share PR.

Thanks for your advice though

tedster

6:30 pm on May 8, 2007 (gmt 0)

Variants in capitalization of the domain name itself are not a canonicalization problem.Variants in capitalization for the rest of the url -- the part that follows the domain name -- those do matter and can cause a canonicalization problem.

AjiNIMC

6:46 pm on May 8, 2007 (gmt 0)

Variants in capitalization of the domain name itself are not a canonicalization problem.Variants in capitalization for the rest of the url -- the part that follows the domain name -- those do matter and can cause a canonicalization problem.

Very true, since the lower case and upper case domain names are technically same we do not need to do canonicalization (also we are helpless, we are not able to do anything as the server variable {HTTP_HOST} is in lower case always). Canonicalization is needed for the URLs which can be technically different but are same for your domain. Example www.idealwebtools.com can be technically different from idealwebtools.com (without www) but currently represent the same document.

AjiNIMC

7:01 pm on May 8, 2007 (gmt 0)

This factor (capitalised tld) causes PR to disappear completely.

This looks like a issue with Google toolbar as it may be taking the HOSTName from the browser. In FF it automatically converts in but for IE it does not. Please pm me the domain, I will like to investigate more.

Google toolbar checks the PR like this
toolbarqueries.google.com/search?source
id=navclient-ff&features=Rank&client=
navclient-auto-ff
&q=info:http%3A%2F%2Fwww.google.com%2F

So here the domain name goes as a variable, google needs to handle it smartly after this. This is a bug at google's level, do not worry.

AjiNIMC

[edited by: AjiNIMC at 7:03 pm (utc) on May 8, 2007]

bcrbcr

8:23 pm on May 8, 2007 (gmt 0)

Thanks again for the replies - it seems I am worrying too much.

The domain name is probably not relevant because the same thing happens when you compare www.google.com and www.google.COM (at least it does in my Google toolbar). This is interesting.

thanks, guys

[edited by: tedster at 8:51 pm (utc) on May 8, 2007]

bcrbcr

9:27 pm on May 8, 2007 (gmt 0)

This does raise a useful marketing issue from my point of view (I am more marketing and less coding based).

I have been worried about using the "camel" type of approach when writing or displaying domain names, even though I favour them, beacuse it makes them eminently more readable than lower case type everywhere.

In my reading of various pages on capitalisation issues in Webmasterworld, I have been moving towards the position of removing ALL capitalisation, including that in the domain names. I misunderstood that using capitals in the domain names this could cause canonical issues in SEs. Apparently not true.

What I have NOW understood is that use of cpaital letters before the "/" is fine - use of capitals or mixed case after the "/" is courting disaster.

For example,

www.TheBestWebsiteInTheWorld.com

(to me) is preferable and much more readable than

www.thebestwebsiteintheworld.com

when placed on business cards, stationery, webpages etc.

What I now understand (please correct me if I am wrong) is that using capitals in domain names in this way is acceptable - as long as you don't go beyond the "/" with capitals or mixed case letters (in an ideal search world).

What would be wrong would be to write
www.TheBestWebsiteInTheWorld.com/NextPage.htm

whereas
www.TheBestWebsiteInTheWorld.com/nextpage.htm
would be preferred (in my ideal, lower case world)

My efforts to achieve fully canonical, lower case nirvana can now focus on lower case urls, not their parent domain names. So the good stuff from jpMorgan at

[webmasterworld.com...]
A guide to fixing duplicate content & URL issues on Apache
really relates to the back end of the url.

Thanks for clearing this up (if I have it right ...)
Bryan

tedster

9:42 pm on May 8, 2007 (gmt 0)

What I now understand (please correct me if I am wrong) is that using capitals in domain names in this way is acceptable - as long as you don't go beyond the "/" with capitals or mixed case letters (in an ideal search world).

Yes - I' ve even seen googlers recommend it.

This 49 message thread spans 2 pages: 49