Forum Moderators: Robert Charlton & goodroi
Here's a question I haven't been able to find an answer for.
I know from experience that if in one place on a site a link to a page refers to "Index.htm" and on another page a link to the index page is referred to as "index.htm" when you do a site: query on Google you will mention of two pages: Index.htm and index.htm
As a programmer, I find this quite surprising as web servers are not case sensitive i.e. both Index.htm and index.htm will resolve to the same page, and it's a no-brainer for a programmer to remove all capitalisation from a string of text which is a long way of saying that Google ought to know that Index.htm and index.htm are the same page.
BUT, I know that it does actually view them as separate pages. My question is this: as a consequence of this, Google has two pages in its index, which are in fact the very same page, so will this lead to the duplicate content penalty?
And the last question I have on this subject is, given Matt Cutts blog posts about duplicate content, is this something I need to worry about?
The main reason I ask is that I am working on a client site and the original developer was REALLY sloppy about the link text used throughout the site, and when I went to build a Google sitemap using a software tool, some pages were appearing four or five times due to inconsistencies in the capitalisation used.
To compound matters, Google knows about both http://www.example.co.uk and http://example.co.uk (both point to the same content) and to make matters worse, they have a .ie domain as well carry the same content (both [yyyy.ie...] and [yyyy.ie)....] This adds up to something potentially 20 pages known in the index all with the same, identical content.
Do I need to worry about this?
[edited by: tedster at 3:31 pm (utc) on April 20, 2007]
[edit reason] switch to example.co.uk [/edit]
I find this quite surprising as web servers are not case sensitive
Only Windows servers are not case sensitive - other servers are. From the beginning of the web, the technical convention for urls has been that after the domain name, the URL is case sensitive.
from the W3C website [w3.org]:
URLs in general are case-sensitive (with the exception of machine names). There may be URLs, or parts of URLs, where case doesn't matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive.
Yes, we have had several discussions about this here. Some recent threads are:
[webmasterworld.com...]
[webmasterworld.com...]
The issue to be concerned with is not a "penalty" because a true penalty is rare for duplicate content. But rather, one version of the url is likely to be filtered out of search results, and the power of backlinks is "split into two piles" or possibly more. Having only one url could make the difference to being in or out of the supplemental index for reasons of low PR.
And you are also correct, taking care of the "with-www" and "no-www" issue will also be quite helpful to focus backlink influence on just one version of a url and get better rankings. See Why "www" & "no-www" Are Different - the canonical root issue [webmasterworld.com]
These and other related topics are discussed quite deeply in the Google Hot Topics [webmasterworld.com] area - a collection of important threads that we keep pinned to the top of th\e Google Search forum's index page.