Canonical Tag vs. Block in Robots.txt

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Canonical Tag vs. Block in Robots.txt

Planet13

3:48 pm on Apr 23, 2011 (gmt 0)

Hi there, Everyone:

The product pages on my ecommerce web site are (by default) available via multiple versions of the URL (namely, a long query string version, and a short version).

For years, I have simply blocked the long query string URLs via the robots.txt file (The long query string URLs have a "virtual" directory in the URL, so I just block that virtual directory).

But with "trust" being such an important issue after the Panda updates, I wonder if it might be better to unblock those URLs in robots.txt and just let the canonical tag take care of it.

In webmastertools, under crawl diagnostics, it lists something like 700 URLs blocked by Robots.txt, and if it is something that is being measured by google, I can't help but think that they are somehow using that information for something.

enigma1

3:45 pm on Apr 28, 2011 (gmt 0)

For the initial request I will always do a redirect on a mismatch. It's just where to redirect but that depends on the query. I think a 404 passes no juice in any case and for retired items, I would prefer the visitor to go to a similar one if possible. If it's not possible redirect to the home page.

Linking of products into multiple categories or articles into topics etc I found it harder to manage in various cases. That is if the requirement is to expose both categories and products, or brands and products with the product url. I would have to pick up the first brand or category and prefix the url one way or another to get around the dup generation. Using ids the challenge is going to be the same.

And yes, is way simpler to generate the urls for a single entity vs combining parameters.

Adding a hash would mean another encoder/decoder although tiny I try to simplify the code as much as possible. You also need a prefix for the ids in order to differentiate the entity type.

Back to the OP's post a bit, robots.txt I rarely used it to filter urls because it doesn't work. Google will still access them because it won't read the robots.txt before every single request. If for some reason you need to modify the url structure or add/remove parameters make sure the old structure is not regenerated in some way.

I think lots of the canonical problems because of it. In some cases hard-coded links are forgotten, hidden with content and the store owner tries to block the old url but Google still sees it. There is no detailed information about it in WMT, in other words all steps the search engine followed to get to the duplicate page. That would be really helpful.

Years ago when shared hosting was popular, I remember seeing cases where urls will be inserted with the content and will include the session id. Needless to say what the consequences are. I am sure a similar thing happens today with tracking ids, referrers etc and just one mistake in the content, can cause a great number of duplicated pages and security issues.

Vimes

8:51 am on May 5, 2011 (gmt 0)

Hi,

Here's another scenario that I've just inherited.

basically a web site syndicating it's content to power other website product's. They've placed a canonical element on to the external domains pointing back to their original URL thinking this would be enough.

When i'm using the site: operator on the external domains there are thousands of pages out there with the same content, searching for the content of any given page, returns results of multiple URL's across multiple domains.

i've been thinking of placing a robots text file on the domains that they host, but reading Google's "best practices" this isn't what they are suggesting

"One item which is missing from this list is disallowing crawling of duplicate content with your robots.txt file. We now recommend not blocking access to duplicate content on your website, whether with a robots.txt file or other methods. Instead, use the rel="canonical" link element"

Due to existing contracts i can't just stop delivering the content, I've working on ways to make the content different for each domain but i need a quick solution, so what other ways can i get these pages removed, by placing an additional meta noindex,follow
would that remove the duplicate issue and allow any link juice they are getting from these pages to be past?

Vimes

tedster

2:44 pm on May 5, 2011 (gmt 0)

That would do it - but the de-indexing effect may be more gradual than quick.

Are those syndicated domains showing as results for regular query phrases, not just site: operator queries? If the canonical link has been in place for any amount of time, that would surprise me.

Vimes

1:36 am on May 6, 2011 (gmt 0)

Tedster,

I'm told the canonical has been on the pages for 6-8 months, when searching for some of the terms i see some of the external domains ranking, they do rank below the original pages, but not by much. They have some authority web sites that they partner with and these pages are the ones that i see in the SERP's and from what i can tell are challenging the ownership of the content even though the pages have the canonical element on page.
This site took a large hit during Panda, and with the amount of content syndication they have done I'm surprised that it hadn't dropped way before. In Google's eyes its lost its trust, it would appear.

so adding the meta noindex on these external sites should start the process of regaining this?

Vimes.

This 64 message thread spans 3 pages: 64