|How Google Handle Common Canonical Issues "On Their Own"?|
For a long while now, many webmasters here have been paying attention to canonical URL issues [webmasterworld.com] - especially the "www" type and the "index.htm" type. Google communicated a lot about this over the years as well - letting us know that it was a tough problem and giving advice on how webmasters could help the situation. Eventually all three major search engines agreed on the rel="canonical" meta tag -- and soon Google is even going to allow that tag to work across different domains.
All this is now history, but I've got an observation and a question. It's been months since I analyzed a website and saw evidence of these two canonical problems in any set of Google site: operator results. The problem used to just jump out - and the same thing was also true for the "soft 404" duplicate content problem.
This site operator seems untroubled, even for websites that do not handle these canonical issue on their own server. This is not yet so for the other search engines, by the way.
What I'm wondering is this. Does Google handle these canonical issues only in a cosmetic way -- hiding the problem from view but not really combining link juice for the URL variations? Or are at least the most common versions of the canonical url issue now handled "for real"?
My impression is that Google now handles these two common problems for real, as well as the soft-404 problem. I'm wondering what others are seeing.
I believe that these problems are handled 'for real' but strongly prefer not to rely on any third party to 'get it right' unless there's no other choice.
It strikes me that working around server configuration errors and the resultant canonicalization problems is an undertaking that does not scale very well. So at some point in the future as the Web continues to grow, Google (and the others) may 'run out of time' to process all of the sites that have (or may have) canonicalization problems, and so some smaller or less-important sites may not be so processed.
This is obviously a back-end process since all URL variants must be fetched before they can be compared, and at some point it may become too resource-intensive for search engines to 'try' all of the various non-canonical variations of domains and URLs to see if they all resolve to one domain or to one resource within that domain.
IOW, they may eventually ask themselves whether they can continue to afford the resources to fetch 16 or more possible URL variations for every page, compare the results, and note any likely-accidental duplication in their database(s) used for ranking purposes.
As a result, I prefer not to "rely on the kindness of strangers" regarding the health and well-being of my sites; I still recommend that steps be taken in the server configuration to 301-redirect non-canonical URL requests to the corresponding canonical URLs, and the use of the on-page HTML "canonical" tag as an alternate method if this server redirection cannot be accomplished for any reason.
My experience is that Google is pretty good these days at handling very obvious cases of duplication - the index document example you mention, for instance. I'm pretty sure there is 301-like handling when Google gets it right - usually for 100% identical documents.
There is a major category of duplication problem that Google will never be able to handle very well - where a site has changeable content, and so when it visits one "duplicate" one day and another the next, the content is somewhat different. I've found the threshold of changes required seems to be quite low for Google to see them as duplicates it can confidently map to one location. In those cases, I think we have the other problem you allude to - the gradual "hiding" of content in Google results. It's never been so hard to track down specific content with specific searches.
I'm with JD Morgan on fixing it yourself. Just one tiny example would be social bookmarking services, they rarely (if ever) get even the most basic canonicalisation tasks right. And that can be the difference between a highly visible link and one that no-one ever sees.
If I were to write a hierarchy of the most effective ways to fix duplicate content issues, it would likely come out like this:
- One URI per item of content (100% success ;))
- Multiple URIs with the same content, but redirects to a single location (rarely fails although occasionally can - more failures now then ever before in the past)
- Canonical attribute (can be effective but in some cases (e.g. where duplicates will always be somewhat different when Google retrieves them) will never work
- Let Google sort it out (very mixed results)
To clarify what I said and to amplify and expand on your list:
On your own site, link only to canonical URLs. Use the 301-redirects or canonical tags only to 'correct' the linking mistakes of others. We certainly don't want to be sending "inconsistent- and "sloppy- linking signals to search engines!
A key clarification, Jim. In the era of dynamic websites, content can be "created" just by the way someone formulates a URL. Perhaps the best practice is:
- Don't create any non-canonical URLs
- Fix the non-canonical URLs others create
There are cases like www/non-www whereby I'm creating non-canonical URLs as a convenience, so I guess that doesn't cover every eventuality.
One other thing I think might be worth throwing in. For some sites you'll get a message in Google's Webmaster Tools saying "Googlebot found an extremely high number of URLs on your site". These are types of URLs that Google gives up on even spidering, because of the high risk of duplication, primarily. So at one extreme you will not even get crawled if you don't fix duplication/canonical URL issues yourself.