|3 versions of one page?|
What does Google think?
If I do a search for a string of unique text from one page, I get 3 results:
Does Google really think these are 3 different pages? If so, why doesn't it remove 2 as duplicates? Most importantly, should I condense these 3 versions down to one? If so, what's the best way to do it?
Google thinks they're different URLs, which they are.
In HTTP and in Google, www.example.com/ and www.example.com are equivalent, but www.example.com/foo/ is different from www.example.com/foo and www.example.com/Foo/. Also, www.example.com/foo/ is different from example.com/foo/
Google also treat www.example.com/index.html as www.example.com even though the rules of HTTP do not.
If the three URLs continue to serve identical content then Google is likely to merge the three, removing two and giving the remaining URL the backlinks from the others. This does not happen instantly, and if you have a part of the page that changes frequently (e.g. latest news or today's date) then Google may find different content when it fetches the URLs at different times.
ciml, thanks for the explanation.
I suppose that my PR will improve if I fix this problem.
It would likely cause problems with some individual pages though. For example, if Google had already winnowed out www.example.com/foo and saved www.example.com/Foo/, my changing it to www.example.com/foo would result in the page essentially being de-indexed. Anyway, Googlebot would eventually come across it and all would be well.
Do you agree with these assumptions? If so, is it always advisable to go ahead and convert a mess like mine to a standard format?
On a similar line, how does SE's treat a mixture of absolute and relative internal links pointing at the same page, within the same site? Say there are both a link h*//www.example.com/foo.html and a link foo.html pointing at page foo.html. Everything within the same folder.
Is there a risk of duplicate content penalty due to such inconsistency? Is it worth the trouble cleaning up such a structure? Or maybe it's even necessary? I've asked this before elsewhere on WebmasterWorld, but sadly nobody seems to really know.
geekay, I believe that Google will issue a duplicate
content penalty for this. Here is what happened to us:
We have a couple hundred pages. Our homepage file is called default.htm. On a handful of internal pages, links back to the homepage were mistakenly called default.html or worse yet, index.htm.
When I discovered we got hit, I spun my wheels for weeks before figuring this out. I did a search, and sure enough Google had crawled the pages with the incorrect links and had indexed the phantom, non-existent pages as "real", all with identical content to our homepage.
I cannot prove this was the problem, but once I marked those phantom pages as "do not index", we were back in Google.
An inconsistent use of absolute and relative links looks so innocent. An impurity in the coding of a site. Yet the consequences may be very serious.
But can my and Lou's cases be compared? In my case the file name of the page is always "foo.html", although the link varies.
I believe this situation is not uncommon and could therefore be worth sorting out. I sincerely hope more webmasters would care to give their views.
I agree that this seems to be an important problem.
I'm wondering what happens when inbound links vary. For example, one is www.site.com/Foo. Another is www.site.com/foo. If Google can successfully follow both, would it index them as two separate pages? If so, couldn't competitors mess with your site by splitting your PR?
Use a 301 Redirect from non-www to www to fix one of the problems.
The other problem, the capital letter on the Folder name is more tricky. Have you recently moved servers, from linux to windows (or vice versa)? Linux servers treat all names as unique, Folder is not folder is not FolDER, whilst windows will treat /folder/ and /Folder/ and /FolDER/ as all the same folder, all served with a status of 200. If people link to you in different ways then each will be seen as a legitimate folder name; when really they should not, if you use windows servers. The only way to fix this is to get the incoming incorrect links modified.
domain/Foo and domain/foo if actually are serving the same page will:
1: Get treated as 2 pages by Google (unless proper rewrite rules or redirects are in place)
2: Get hit with a dup content penalty (filter)
If not serving the same page they are still treated by Google as 2 pages.
As for PR if the actual domain/Foo and domain/foo were supposed to be the same page and you have IBLs to each page then those page's PR would be lower than it would have been if all links went to the same page.
Other PR gotchas:
This can be a huge problem if a site gets split on the www/non-www form of the domain and both forms have IBLs. This is besides the dup content problem.
This one cascades through the sites that this site links to.
Thank to CIML for some extremely informative posts about the 302 hijack issue here and elsewhere.
I wonder if we are victim of our own complex addressing or 302 items? We lost almost all Google traffic in Feb after years of being loved by them. We just started 301 redirection to resolve multiple pages rather than having them point to same spot on server, which we thought would not be considered dupe content by Google but was getting doubly indexed - at least after Feb 2.
We still have a problem with site:oursite.com which shows enormous numbers of our *tracking links listed as pages* at our site - these resolve to the external sites(!) - 6 weeks ago we changed robots.txt to allow bots to follow them but that has not removed them.
A few days ago, I used a 301 Redirect from non-www to www to fix that problem. I used the Google removal tool to remove the pages with upper-case characters.
This was quite a radical change for my site. So far, my Google traffic has gone from bad to worse. I'm still hopeful that it will improve once the PR gets sorted out.
At the moment, however, I'm having second thoughts. If Google places value on older links - and I think it does - some of those old links that I removed might have been important.
Anyway, thanks for your suggestions everyone.