Cloaking to remove duplicate content?

Forum Moderators: open

Message Too Old, No Replies

Cloaking to remove duplicate content?

Would Google consider this to be acceptable use?

MovingOnUp

10:43 pm on Jun 9, 2004 (gmt 0)

I have one site that has many different pages that Google and other search engines would consider duplicate content. The duplicate pages are useful to users as they narrow down selections, but I would rather have just the main page indexed.

I've tried controlling Google and other search engines with robots.txt. When I exclude the other pages, they don't spider them, but they do include them in the index with just the URL. Alternatively, I've tried META ROBOTS NOINDEX tags. That works a little better, but it wastes a lot of time and bandwidth as it still spiders the entire site. Also, the pages still appear as URL's for a while.

The only solution I see is to cloak. I can fairly easily detect the major spiders and remove the links on the page I serve them. For instance, I could change <a href="/bluewidgets.htm">blue widgets</a> to blue widgets.

Would Google frown on this? Is there any other solution without resorting to cloaking?

TIA!

robotsdobetter

11:02 pm on Jun 9, 2004 (gmt 0)

I don't think it would because Googles says create your site for your visitors not for Google or other search engines. Many sites use Cloaking for their visitors, but Cloaking that Google don't like is when it's used to get a higher ranking.

MovingOnUp

1:21 pm on Jun 10, 2004 (gmt 0)

This certainly wouldn't be to increase PageRank. Also, it would be the same content, just with some of the links changed to just text. The page would look the same but some of the text wouldn't be clickable.

Let me explain a little more about the pages that I'm wanting to hide from Google. My site sells several thousand widgets. Also available are several bundles where shoppers can choose a set number of widgets from a list of several hundred.

When shoppers look at the page for a specific bundle, I show the most popular widget available in the bundle first. This is the page I do want indexed.

If the shopper is looking at a specific widget that is available in a bundle, I provide a link to and try to upsell to the bundle. If they follow the link from a specific widget to a bundle, I assume that they're interested in that specific widget and change the display order of the widgets in the bundle to weight both the popularity and the relevance to the selected widget. As a result, there are several hundred variations on each bundle page. I don't want these variations indexed, as they are essentially just duplicate content (in a different order).

Similarly, I would like to hide the "add to cart" links from search engines. On each product page, I have several dozen different "add to cart" links for various configurations of each widget. I exclude those in robots.txt, but Google still adds them to the index as just URL's.

I would love to hear more opinions and perhaps even a more official answer from GoogleGuy.

volatilegx

1:43 pm on Jun 10, 2004 (gmt 0)

Hi MovingOnUp and welcome to WebmasterWorld :)

This would be a very benign use of cloaking indeed. However, I can think of a better way to accomplish what you intend.

Why not simply deny access to bots on the pages you wish to exclude from the indexes?

You could do this via Mod_Rewrite. Test for the user agents you want to deny, then give them a Forbidden error whenever they try to access certain files. If you put a "marker" in the filename of the files you want excluded, such as a special string of characters you could write a regex for, it would be very easy to write your Mod_Rewrites.

The advantages of doing this are that it isn't cloaking at all (no possible penalties), use of bandwidth is limited (bots won't be spidering cloaked pages), and your pages you don't want indexed won't appear in the index at all. The disadvantage is... well I can think of any.

Dan

MovingOnUp

2:12 pm on Jun 10, 2004 (gmt 0)

Thanks Dan, but here's the problem with that. Even if they can't get to a page, if they see a link to the page they'll add it to the index. It'll show up in the index with just a URL--no title and description--until they spider (or attempt to spider) the page.

The things I've tried:

1) Excluding the pages in robots.txt. This is about the worst. They get added to the index as URL only pages. Google never spiders the page, so it never gets removed.

2) Blocking the pages with <META NAME="robots" CONTENT="NOINDEX">. This is not too bad, but is very wasteful of bandwidth. When Google finds a link to the page, they add it as a URL only page. When they get around to spidering it and see NOINDEX, they remove it.

3) Blocking access to the page through mod_rewrite (as you suggested) or similar methods. This is better on bandwidth, but still wastes the spider's time. When Google finds a link to the page, they add it as a URL only page. When they get around to spidering it and are unable to access it, they remove it.

All these URL only pages really add up for me. I have thousands of pages that I want indexed, but I have tens of thousands of other pages that are a waste for Googlebot to try to index. I have to believe that at some point Google will say "90% of the pages we try to index on his site aren't indexable. We'll focus our spidering efforts on other sites." I'd rather turn Google loose on just the pages I know they would want to see and not have them waste processing time, bandwidth, index storage space, etc.

volatilegx

3:21 pm on Jun 10, 2004 (gmt 0)

If that's the case, you might want to consider using a JavaScript link to the pages you don't want indexed. You should be able to construct a JavaScript link that bots won't follow.

MovingOnUp

3:41 pm on Jun 10, 2004 (gmt 0)

Excellent suggestion. The only downside to Javascript is that some small percent of users won't support it.