Canonical link tag vs. robots.txt exclusion - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Canonical link tag vs. robots.txt exclusion

griz

11:49 pm on Dec 20, 2009 (gmt 0)

10+ Year Member

We have an auction site, and we generally disallow crawling through robots.txt for two types of URLs:

1) The individual auction pages

2) URLs that are refinements of the product search.

The general concerns are things like the auction pages being short-lived and probably not suitable for being indexed. Or worries of duplicate content issues for auctions on the same product or some URLs are just refinements of the primary search results. Increased load on our webservers is also a reason.

However, we're considering dropping the robots exclusion and instead using the canonical ink tag on the individual items or refinement URLs and pointing them to the product level that the auctions revolve around.

Plusses:
- Take advantage of the inbound links that are to robots.txt deprived URLs
- Letting Google get a better feel for our site and exposing it to the item-level or refinement-level content that's on our site. There is the idea that Google would rather you open everything up and let them decide what's important or not.

Minuses:
- This will substantially increase the number of pages available for Google to crawl. Google could waste crawl resources on pages that are daughter URLs instead of crawling distinct canonical ones. If we only get X crawls per Y time, do we really want Google crawling what we're telling them are non-canonical URLs anyway? Or maybe Google increases the # of crawls we get when it sees the pages and deems them interesting enough?

- Server load and bandwidth issues. Do we want to be delivering a potentially large increase in pages, images, etc. for pages that aren't going to be indexed anyway. If it leads to new traffic, that's fine. If not, then it's just bandwidth costs out the door.

Any suggestions? I'm leaning towards the canonical link implementation and opening up the robots.txt, but want to see if I'm missing anything.

GRiz

tedster

2:11 am on Dec 21, 2009 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hello griz, and welcome to the forums.

One of the stumbling blocks to the canonical tag is that the content will need to be substantially the same, or else Google will just ignore the tag. If they do follow the tag's suggestion, then only the canonical version will be indexed - so there would be little if any added traffic. If they ignore it, then you will be into that kind of URL pile-up you were originally hoping to avoid, no?

The potential added power from whatever back links accrue for those short-lived versions of the page might be a ranking help for the rest of the site. Then what happens when the special pages expire - will you serve a 301 to the canonical URL at that point?

griz

2:38 am on Dec 21, 2009 (gmt 0)

10+ Year Member

Good points. 301 at the item level doesn't make much sense.

However, Google Product Search won't allow item-level URLs that are denied by a robots.txt file. So, I have to get the individual items out of the robots.txt disallow section. I suppose I could use a meta noindex tag (but keep the follow) instead at the item level instead of a canonical link to avoid having them indexed.

For the other types of forbidden URLs that are more permanent in nature, I can use the canonical link and if Google thinks they are distinct enough to ignore it, that's fine.

Thanks for the feedback!

GRiz

gn_wendy

9:08 am on Dec 21, 2009 (gmt 0)

10+ Year Member

In this case (if I have understood the issues) I would use index/noindex tags.

Let us say you have a blue widget to auction. For as long as this is a valid item, I would have it crawled and indexed. As soon as the item is sold or no longer relevant set that page to noindex. This could lead to problems if you have hundreds of thousands (or millions!?) of items though...

If you have a similar blue widget later you would have that as index. If you have less than say 50,000 to 100,000 pages I would simply link from the old "noindex" page to the newer "index" page.

What I have noted from G is that if you let the bot go bonkers and have virtually everything crawlable, but only a small amount tagged with "index", G will up the crawl rate (and vice versa). At the end of the day though,
relevant (index) pages crawled / total pages crawled = X
Where X stays pretty much the same in the long run. That said, my tests have shown that "less is more". In the spirit of full disclosure though - these test have been run on domains with pages ranging in the millions and multiple languages.

Personally I would only change your robots.txt if you are confident that you are getting 'real' links to about a minimum 5%-10%, and 'social media' / 'bookmark' / 'nofollow' links to about 10-15% of your "longtail" product pages.

A question that might help, is G' trying to crawl the blocked pages, and is G' finding links pointing to those pages from outside your domain?

Best advice I can give without more specifics...

griz

10:48 pm on Dec 22, 2009 (gmt 0)

10+ Year Member

We noticed that there were a surprisingly large amount of blocked URLs that were partially indexed by Google despite being blocked with a robots.txt entry. Since people were linking to them, Google will crawl the linked URLs even if blocked. When we look at our logs, a good chunk of these blocked URLs can be solidly canonicalized and aren't inventory-specific. The inventory-specific ones are dead URLs since the item has expired. Those will get noindexed.

GRiz