|A guide to fixing duplicate content & URL issues on Apache |
How to canonicalize all of your URLs with a single redirect
heavy meal for a saturday; i'll dig in on tuesday ;)
|I refuse to do anything specifically for the SE's benefit. |
How is it to the search engines' benefit? Do the search engines care if John Doe's site is screwed in the rankings because of a canonical issue?
It seems to me that the beneficiaries of the optional "canonical tag" are people like John Doe. For the search engines, implementing that option for John and his peers is just another chore and expense.
What is the point of this? You need to place this code on all duplicate URLS! My site products 1000 duplicate urls.. sort based urls, search based urls and all are dynamic.. how am I supposed to place this code on all these pages?
If you can isolate your unwanted URLs through a mix of robots.txt and re-writing, then you don't need to add this tag to those pages. e.g. we put our Search URLs in a virtual folder called /search/ and block it in robots.txt.
Has anyone considered how this tag can be exploited by hackers adding a link in the sourcecode that is not visible on the actual page, and what effect this could have in the future along the lines of an old fashioned 302 hijack?
Oooh, good point Jake.
from the way i read it, you're not allowed to point the link to a page on another domain -- so the hackers could only mess up your pages, rather than benefit their own.
and the only people who'd bother doing that are your direct competitors, which is pretty unlikely.
|How is it to the search engines' benefit? Do the search engines care if John Doe's site is screwed in the rankings because of a canonical issue? |
You're perhaps newer at SEO and haven't seen the history of SE introduced tags.
The nofollow tag also started out being promoted as being for 'our own good'. A few years later, it's used very much, perhaps primarily, for the SE's benefit, to the detriment of end users.
In fact based on nothing other than the SE's treatment of nofollow, webmasters and SEO' would be well advised to be very suspicious of another big wooden horse parked at the front door.
[edited by: tedster at 7:06 pm (utc) on Feb. 15, 2009]
[edit reason] fix quote box [/edit]
I saw that upon further examination that the redirect won't go to another domain, somehow i missed that.
i have a throw-away domain that i may use to test this out on, just to see how the search engines handle it. any particular requests or suggestions on how it should be done?
I am curious how the search engines will handle it if the canonical tag on a url has one canonical form of the url but ALL the on-site links use a different form for the url.
In other words, the sentence says "Search engines will consider this URL a “strong hint” as to the one to crawl and index." and I'm wondering how strong that means. This might be helpful to understand for some very large sites where policing all the inline anchors would be unweildy.
it would also be interesting to know if they would then treat the form of the url in the site navigation differently from the url form used in "real content" on the site.
The implications of the tag's permitted use across subdomains could be... interesting - what would its effect be on sites offering subdomain hosting (eg. blog networks)? What about on a .uk.com or similar "pseudo-TLD" domain?
|...very suspicious of another big wooden horse parked at the front door |
Just like nofollow, this is another ill-thought-out patch where once again webmasters have the responsibility for the search engines' own algo failings foisted upon them.
|Just like nofollow, this is another ill-thought-out patch where once again webmasters have the responsibility for the search engines' own algo failings foisted upon them. |
Nope, it's a rescue tool to help site owners clean up the messes they've made by having multiple URLs for each page.
What's more, nobody is required to use it.
I don't need it myself, but I can see why some people will be grateful for it, and I don't see why they should be deprived of it because a few other people have nightmares about Trojan horses.
|Nope, it's a rescue tool to help site owners clean up the messes they've made by having multiple URLs for each page. |
Rather like being handed a hatchet so you can repair the hole in the bottom of your boat...
In my opinion, the example of nofollow speaks for itself. Originally offered as a tool to help cut back on blog spam, it is now a blunt force instrument used for everything from "PR" flow to those dastardly paid links.
I'm all for industry standards. SE's do not (yet) own the net.
|Let's take the third example, items.php?sortby=name&page=2&query=foo |
That's a unique representation of the data. Never mind that the items in the list are repeated in other views with different sorting and filtering - it's still unique and in my books, it's canonical.
I agree with your statement, but question the wisdom of presenting the search engine with a potentially huge number of views on the same data set.
I would be tempted to misuse? the tag to reduce the number of combinations offered for indexing in the hope of all items being visible in search results at least once.
|it's a rescue tool to help site owners clean up the messes they've made by having multiple URLs for each page. |
Many site owners didn't write the code of their web platform and are not aware they have multiple URLs.
|I'm all for industry standards. |
I'd say this tag is needed mostly because some server software and some hosting environments have been ignoring standards for a long time. I just did a quick audit for a new client who is on shared Windows hosting. Unless they move to a new host, this tag is going to be their only real hope to control a combination of ten different canonical issues.
Yes this tag will also benefit the search engines. For one, they'll have a better shot at ranking some websites whose content deserves it but has been crippled through url problems.
I've often observed that websites+search engines creates a competitive/cooperative environment. If you only see one half of that, and I'm talking about either half, then you're missing the real picture and can make some poor decisions. This environment is a common kind of "game" in game theory or ecology. The warfare model misses the boat, and so does the sweetness-and-light model.
Anyone with technical questions may find some help from Matt Cutts' new blog post [mattcutts.com]. He also links to slides, an instructional video, and some new plug-ins for WordPress, Drupal, and Magento that Joost de Valk created for the canonical tag.
[edited by: tedster at 8:48 am (utc) on Feb. 16, 2009]
I've been thinking about the kind of protection this offers from competitive disruption for sites that get linked to with log-spam style strings such as ?link=example.com.
Or how about sites that allow some query strings but have trouble scripting a rule that wipes out the crazy variations that sometimes appear. If this new canoncial tag gets used as advertised, then many webmasters will have an easier time of it. And "as advertised" includes combining link juice. That's a promise that remains to be seen in practice.
I do think this is good for e.g. a forum where you might have dozens of ways (all with good reason) to get to the same content or subsections of it (think pagination). But it's something that those who make the software will have to implement.
|If this new canoncial tag gets used as advertised, then many webmasters will have an easier time of it. |
Most webmasters don't even know what canonical is or even how to spell it ;).
This is going to go right over the head of 99% of webmasters. Of the remaining 1%, 3/4's of them are going to screw up the implementation.
Guilty! *hangs head in shame*
|Many site owners didn't write the code of their web platform and are not aware they have multiple URLs. |
A long-time competitor that does their website in-house finally implemented database/asp on their product line. I know through spying on their forum posts that they hired the work out-of-country, that it's a mess, and they have no idea where to begin. But they know enough to properly utilize this tool.
Perhaps that's one of the reasons for some of the (mostly old-school it seems) backlash: it empowers the semi-clueless competitor instantly, with what took some of us many hours ($$) to properly implement.
IF the only 'clean-up' of potential duplicate content is to only have www url... and to remove the non-www, is an htaccess rewrite the better option than a tag at the top of every page?
|The one thing I can think of that no one else here has mentioned at all -- this might slow down some scrapers, who scrap your content verbatim. This tag being so new, the scraper will not know to remove it. Thus the search engines will not give them the credit for the page, when they get crawled. |
So in reference to the quote above, here's my question: Should a site use the new tag -- at least on the home page -- as a "preemptive strike" against scrapers, even if that site does not have a problem with duplicate content?
Or, just to play it safe, should it be used on every page, for the same reason?
|IF the only 'clean-up' of potential duplicate content is to only have www url... |
That would be a rare bird indeed. In a recent thread, we identified around 30 canonical problems [webmasterworld.com] that can occur in combination. Just because you don't see them in your WMT account or site: operator results doesn't mean thay aren't affecting you. Google has been working to combine the various forms of urls for a while, and the effects can be seen in site: results especially. But it is a huge challenge, given the variety that is represented across the entire web.
I'm not about to drop the no-www with-www redirects, and yes, if that truly is the only problem then this tag will not offer your site anything much.
In its Webmaster guidelines Google has advises to:
'Copy this link into the <head> section of all non-canonical versions of the page, such as http://www.example.com/product.php?item=swedish-fish&sort=price.'
If you rely on dynaimically drawing URLs from a DB, that can be extracting on a range of URLs creating DUP content issues this is a nice solution without the tech fix.
However as these pages all use a single header; will it affect the canonical page having this in the header too? Will this created a spidering loop that will have negative implications?
Two questions from an 'Apprentice' at SEO
1. If a website has Google Analytics and Webmaster Tools and their is NO Duplicate Content showing in the Content Analysis does that mean that Google doesn't view that there are duplicate content problems? If duplicate content problems do arise is it fair to say that they would firstly appear in the the Webmaster Tools? Or is that too simplistic? If they aren't appearing there as per Tedster's post.. then how can we find out If they are affecting a site?
2. Following on from the post above by jonny0000, Can the Canonical tag be added to the head section of a Dynamic Web Template used in Microsoft Expression? Would that work? (although all pages are .html and therefore I cannot see the value of the tag apart from making sure the spiders include or not the www - all pages are viewed fine with their full extension and can only be viewed that way.)
[edited by: Gemini23 at 11:59 am (utc) on Feb. 17, 2009]
There are definitely other uses for this attribute aside from fixing duplicate content issues, especially for more grey hat link building and I disagree that using this tag is only for those unable implement 301s. This is another seemingly useful tool that is well worth a bit of testing.
I have set up a basic test which I hope will confirm what other factors google aludes to (if any) influence the implementation of this type of redirect. Foremost in my mind are:
Does the refered page get passed value if it is not linked to on the site at all but the referer is?
Does the refered page get passed value if both pages are linked to on the site?
Well i have two questions relating this issue
If any savvy expert plz help me
Let say my site urls are like this
and so on
and so on
well my site generate dynamic links like for post it generates links like this
and same for categories
so will u plz help me adding this code in header
And 2nd question is that i have blocked these dynamic urls with extra strings using robots file few days back and i hope it will remove duplicate urls from google soon.
So should i have remove the robot rule and add this new tag
Leave robots rule as it is and go for new tag Or what>>?
as my site has lost ranking from few months and im in doubt it can be reason so how can i go for it.
Thanks for ur time
Heh! After 5 pages of discussion, what have we learned?
That the introduction of this Canonical Tag has created more confusion than it was meant to solve. I could go back through this topic and probably come up with a good 50 questions so far. That would lead me to believe that there are going to be some challenges in the implementation of this tag for many. If you are not sure, leave it alone. The worst thing you can do at this point is add insult to injury.
And yes update for my questions i can add separate link tag code in header for category and posts.
| This 137 message thread spans 5 pages: < < 137 ( 1 2  4 5 ) > > |