Forum Moderators: Robert Charlton & goodroi
Why I might not do it
Why I might do it?
What does Google do?
They will take the most appropriate URL, which is with / almost 99% of the times. I have never seen a duplicate content issue so far due to this. For www and non-www google has an option under webmaster console.
What is your opinion about it?
Thanks,
AjiNIMC
I'll just address one of your points...
Very important, when a person is coming to index.html I am redirecting it to / which is making my customer wait.
You need to do a server side 301 redirect, which is virtually instantaneous. Your customers will never notice.
Possibly, though, you may be thinking of a client side (browser) redirect, perhaps a meta refresh redirect. These are entirely different from proper 301s... not the same thing at all.
[edited by: Robert_Charlton at 9:01 am (utc) on May 2, 2007]
Very important, when a person is coming to index.html I am redirecting it to / which is making my customer wait.
You don't have to do it. Make index.html as the canonical one and eventually redirect the other way, or forbid duplicate ones with the robots.txt.
The point is that only one uri should be authoritative, regardless of which one.
No, because after it caused a disaster at MSN I've 301'd index.html or index.htm to the site root/ so no one will be able to see it, it'll redirect. You'll just have to take my word for it that exactly that very issue caused a couple of sites to totally bomb because the engine got it wrong (not Google, but why play with fire - MSN Search sends BUYERS). Besides which, I don't lift my skirt in front of strangers, and I don't know you. ;)
>>I have always Google taking care of it.
No, it isn't always taken care of, and for very good reasons. index.html may or may not be the root of a site, they are not necessarily the same page and often aren't. The root could be index.php or default.htm or whatever - and don't think for one minute that there haven't been people who spammed the h*ll out of the engines with multiples, using massive numbers of IBLs with different anchor text.
Let's not expect the crawlers to guess what we mean, let's be specific on what the root of a site is so it's made very clear. That's up to us, on our end.
http://www.example.com/ may or may not be the same as http://www.example.com/index.html or
http://www.example.com/index.htm - or any number of variations, which are all different pages.
I've had an instance where MSN Search *lost" the root of a site of mine and instead indexed http://www.example.com/index.html which totally messed things up - there was not one link in the world to index.html and what I had to do was 301 from that to the actual root. It killed the site, because the internal navigation structure and IBLs were totally skewed on their end because of the mistake. The site has never fully recovered, and it's a huge Q4 traffic loss.
Rule #1: Never confuse the bot or make her wonder, think or second-guess. She is not a mind reader, able to second-guess what the webmaster's intentions are.
[edited by: Marcia at 1:33 pm (utc) on May 2, 2007]
They will take the most appropriate URL, which is with / almost 99% of the times. I have never seen a duplicate content issue so far due to this. For www and non-www google has an option under webmaster console.
I most certainly have seen a duplicate content issue with this, and with good reason. They're not the same URL, and URLs are each indexed with a unique DocID for each URL. Read the white papers and the patents for confirmation.
Additionally, the average Joe Shmoe's webmaster generally knows nothing about Google's Webmaster Console, maybe never even heard of it. They're generally totally clueless and just do the index.htm thing for the sake of the ease of running the site in a development environment and then transitioning it to the live site.
The developer's "convenience" is not Google's concern or problem, it never was, still isn't, and never will be.
[edited by: Marcia at 2:05 pm (utc) on May 2, 2007]
I fixed the problem with a 301 redirect (at the suggestion of several members here), and my traffic came back two months later.
Having gone through that experience, I'm inclined to think that it's better to be proactive than reactive. Fix the problem before it becomes a problem, and you may save yourself a nasty surprise.
Late 2004 and seesawing almost off the map in 2005 has left me with zero doubt about the ability of any search engine being able to figure out the mess. They may in some cases try, however they fail more than a small fraction of the time.
sites to totally bomb because the engine got it wrong (not Google, but why play with fire - MSN Search sends BUYERS)
Besides which, I don't lift my skirt in front of strangers, and I don't know you. ;)
No, it isn't always taken care of, and for very good reasons.
and don't think for one minute that there haven't been people who spammed the h*ll out of the engines with multiples, using massive numbers of IBLs with different anchor text.
Rule #1: Never confuse the bot or make her wonder, think or second-guess. She is not a mind reader, able to second-guess what the webmaster's intentions are.
Thanks theBear, europeforvisitors, MrStitch, Marcia, activeco and Robert Charlton for the replies. I am going to do it now :). Thanks for convincing me with scary real life examples. May God keep all away from Search Engine bugs.
You are already in trouble in much the same way that www and non-www cause problems.
You should always link back to "/" from within your site. To make things tidy I always redirect (index¦default¦home)\.(html?¦php¦asp¦cfm¦jsp) back to "/" too, with a 301 redirect.
You can change the technology that runs the website at any time without exposing any new URLs to the outside world. Additionally you make it harder for people to know what technology actually runs the site.
or even: DirectoryIndex SomeOtherDir/Script.pl
So, if you don't know what you're doing, redirecting files back to root could cause loops and additional problems.
IMO, the simplest, fastest and most secure way is to set default root file and exclude it in robots.txt.
Make index.html as the canonical one...
In my experience, links to your site (ie, to the default canonical) are going to tend to be in the form www.example.com/ no matter what you do, so you are creating a problem by making index.html your canonical.
...So, if you don't know what you're doing, redirecting files back to root could cause loops and additional problems. IMO, the simplest, fastest and most secure way is to set default root file and exclude it in robots.txt.
Agreed that the redirection, as well as the setup of the default root file, most definitely needs to be done by someone who knows what he or she is doing. When I get to index.html rewrites, I have someone else set up the server and do the rewrites for me.
But setting up a default root file, of, say, index.html, continuing to link to it and excluding it in robots.txt, IMO, is going to continue to split your PageRank and isn't going to cure your display problem... and it encourages potential links to the index.html form, further complicating the PageRank split.
But setting up a default root file, of, say, index.html, continuing to link to it and excluding it in robots.txt, IMO, is going to continue to split your PageRank and isn't going to cure your display problem...
No, index.html (or any other DirectoryIndex file) is transparent to robots and is associated with root directory. All refering links are still pointing to /. To escape duplicate content issue, straight access to the real file, in this case index.html, is disallowed in robots.txt.
I have not seen any site having two cache one for index.html and another for / but I have seen site having different caches for?id=1 and?id=2 (though they go to the same page). I am doing some experiments to confirm it, it might take some 5 to 6 days.
To what level of canonicalization will we go? Google is the smartest website I have seen, perfect canonicalization done, no other search engine has done it. Try?id=1 for google and try the same for Yahoo or Ask or MSN.
I just saw your recent post under the Google canonicalization thread and saw your rewrite example.
You indicated
(index¦default¦home)\.(html?¦php¦asp¦cfm¦jsp)
I'm pretty rusty on rewrites as well as regular expressions, but shouldn't the rewrite be
(index¦default¦home)\.(htm?¦php¦asp¦cfm¦jsp)
.
The following is my current (simple) Mod rewrite, and I am still confused as to why the capitalisation in the domain doesn't get forced to lower case.
I assumed that www.EXAMPLE.COM would be forced to www.example.com - doesn't seem to work that way.
Rule 1 below deals with the index.htm(l) problem nicely, Rule 2 with the www. vs. non-www very well, and in combination nthey also work.
But I still have capitalisation issues - I don't mean inside the site or with indiviudal URls - different issue. I mean the capitalisation of the domain name and tld
I have read and think I understand the capitalisation discussion on webmasterworld, so I don't think that is the issue. I have also messed with adding [nc] to different lines and still can't make headway.
Options +FollowSymlinks
RewriteEngine on
rewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html?\ HTTP/ [nc]
rewriteRule ^index\.html?$ http://www.example.com/ [R=301,L]
rewriteCond %{HTTP_HOST}!^www\.example\.com$
rewriteRule (.*) http://www.example.com/$1 [R=301,L]
Any ideas from the experts? Have I misunderstood how far you can go here?
Thanks
I assumed that www.EXAMPLE.COM would be forced to www.example.com - doesn't seem to work that way.
This is already considered same, look at the global definition by ICANN ( icann.org/riodejaneiro/idn-topic.htm )
example.com
Example.com
eXample.com
exaMple.com
examPle.com
exampLe.com
examplE.com
EXAMPLE.com
ExAMPLE.com
EXaMPLE.com
EXAmPLE.com
EXAMpLE.com
EXAMPlE.com
EXAMPLe.com
etc.In the languages that utilize Latin characters (e.g., English, Finnish, German, Italian, etc.), each letter has two variants: upper case and lower case. The Internet's basic DNS and hostname specifications provide that the upper-case and lower-case variants of each letter are considered to be equivalent. Thus, all the variant domain names in the above list are treated as the same domain name.
I hope this answers your query.
Regards,
AjiNIMC
If I use the Google toolbar as an approximate measure of relative strength / popularity of a domain (and I know this has weaknesses as an approach, but follow for now...)
www.example1.com has a PR of 10 (let's say, Y for example)
www.example2.com has a PR of 10 (let's say, G for example)
however, with capitals
www.example1.COM has a PR of zero (in Y's case)
www.example2.COM has a PR of zero (in G's case)
Is this significant? Well yes, I think so - because we are all desperately chasing rankings and PR numbers, and any reduction from 3 to 2 or from 5 to 4 would start some sort of seizure amongst most of us - not to mentions the hours of mailas and queries on this site.
This factor (capitalised tld) causes PR to disappear completely.
In the case of my own sites, the same thing happens - I am talking about Apache served sites here, not IIS.
My question is - bearing in mind that this appears to be a canonical issue - how do you reduce your principal URL to one single canonical URL given this tld capitalisation factor
I am sure this has been covered in these pages, but can't find the solution.
Assuming that the Google toolbar is reflecting SOMETHING about ranking, and about duplication/canonicals etc this eeems to be a situation worth addressing. Or is this some sort of illusion and can be safely ignored?
Perhaps the ICANN definition is fine for general programming - but do all servers follow this code?
I would accept that the domain name variations appear not to affect the immediate PR reported by the toolbar - I suppose my concern is that these are all variants / aliases or whatever, that diperse or share PR.
Thanks for your advice though
Variants in capitalization of the domain name itself are not a canonicalization problem.Variants in capitalization for the rest of the url -- the part that follows the domain name -- those do matter and can cause a canonicalization problem.
Very true, since the lower case and upper case domain names are technically same we do not need to do canonicalization (also we are helpless, we are not able to do anything as the server variable {HTTP_HOST} is in lower case always). Canonicalization is needed for the URLs which can be technically different but are same for your domain. Example www.idealwebtools.com can be technically different from idealwebtools.com (without www) but currently represent the same document.
This factor (capitalised tld) causes PR to disappear completely.
Google toolbar checks the PR like this
toolbarqueries.google.com/search?source
id=navclient-ff&features=Rank&client=
navclient-auto-ff
&q=info:http%3A%2F%2Fwww.google.com%2F
So here the domain name goes as a variable, google needs to handle it smartly after this. This is a bug at google's level, do not worry.
AjiNIMC
[edited by: AjiNIMC at 7:03 pm (utc) on May 8, 2007]
The domain name is probably not relevant because the same thing happens when you compare www.google.com and www.google.COM (at least it does in my Google toolbar). This is interesting.
thanks, guys
[edited by: tedster at 8:51 pm (utc) on May 8, 2007]
I have been worried about using the "camel" type of approach when writing or displaying domain names, even though I favour them, beacuse it makes them eminently more readable than lower case type everywhere.
In my reading of various pages on capitalisation issues in Webmasterworld, I have been moving towards the position of removing ALL capitalisation, including that in the domain names. I misunderstood that using capitals in the domain names this could cause canonical issues in SEs. Apparently not true.
What I have NOW understood is that use of cpaital letters before the "/" is fine - use of capitals or mixed case after the "/" is courting disaster.
For example,
www.TheBestWebsiteInTheWorld.com
(to me) is preferable and much more readable than
www.thebestwebsiteintheworld.com
when placed on business cards, stationery, webpages etc.
What I now understand (please correct me if I am wrong) is that using capitals in domain names in this way is acceptable - as long as you don't go beyond the "/" with capitals or mixed case letters (in an ideal search world).
What would be wrong would be to write
www.TheBestWebsiteInTheWorld.com/NextPage.htm
whereas
www.TheBestWebsiteInTheWorld.com/nextpage.htm
would be preferred (in my ideal, lower case world)
My efforts to achieve fully canonical, lower case nirvana can now focus on lower case urls, not their parent domain names. So the good stuff from jpMorgan at
[webmasterworld.com...]
A guide to fixing duplicate content & URL issues on Apache
really relates to the back end of the url.
Thanks for clearing this up (if I have it right ...)
Bryan