Forum Moderators: open
I've tried controlling Google and other search engines with robots.txt. When I exclude the other pages, they don't spider them, but they do include them in the index with just the URL. Alternatively, I've tried META ROBOTS NOINDEX tags. That works a little better, but it wastes a lot of time and bandwidth as it still spiders the entire site. Also, the pages still appear as URL's for a while.
The only solution I see is to cloak. I can fairly easily detect the major spiders and remove the links on the page I serve them. For instance, I could change <a href="/bluewidgets.htm">blue widgets</a> to blue widgets.
Would Google frown on this? Is there any other solution without resorting to cloaking?
TIA!
Let me explain a little more about the pages that I'm wanting to hide from Google. My site sells several thousand widgets. Also available are several bundles where shoppers can choose a set number of widgets from a list of several hundred.
When shoppers look at the page for a specific bundle, I show the most popular widget available in the bundle first. This is the page I do want indexed.
If the shopper is looking at a specific widget that is available in a bundle, I provide a link to and try to upsell to the bundle. If they follow the link from a specific widget to a bundle, I assume that they're interested in that specific widget and change the display order of the widgets in the bundle to weight both the popularity and the relevance to the selected widget. As a result, there are several hundred variations on each bundle page. I don't want these variations indexed, as they are essentially just duplicate content (in a different order).
Similarly, I would like to hide the "add to cart" links from search engines. On each product page, I have several dozen different "add to cart" links for various configurations of each widget. I exclude those in robots.txt, but Google still adds them to the index as just URL's.
I would love to hear more opinions and perhaps even a more official answer from GoogleGuy.
This would be a very benign use of cloaking indeed. However, I can think of a better way to accomplish what you intend.
Why not simply deny access to bots on the pages you wish to exclude from the indexes?
You could do this via Mod_Rewrite. Test for the user agents you want to deny, then give them a Forbidden error whenever they try to access certain files. If you put a "marker" in the filename of the files you want excluded, such as a special string of characters you could write a regex for, it would be very easy to write your Mod_Rewrites.
The advantages of doing this are that it isn't cloaking at all (no possible penalties), use of bandwidth is limited (bots won't be spidering cloaked pages), and your pages you don't want indexed won't appear in the index at all. The disadvantage is... well I can think of any.
Dan
The things I've tried:
1) Excluding the pages in robots.txt. This is about the worst. They get added to the index as URL only pages. Google never spiders the page, so it never gets removed.
2) Blocking the pages with <META NAME="robots" CONTENT="NOINDEX">. This is not too bad, but is very wasteful of bandwidth. When Google finds a link to the page, they add it as a URL only page. When they get around to spidering it and see NOINDEX, they remove it.
3) Blocking access to the page through mod_rewrite (as you suggested) or similar methods. This is better on bandwidth, but still wastes the spider's time. When Google finds a link to the page, they add it as a URL only page. When they get around to spidering it and are unable to access it, they remove it.
All these URL only pages really add up for me. I have thousands of pages that I want indexed, but I have tens of thousands of other pages that are a waste for Googlebot to try to index. I have to believe that at some point Google will say "90% of the pages we try to index on his site aren't indexable. We'll focus our spidering efforts on other sites." I'd rather turn Google loose on just the pages I know they would want to see and not have them waste processing time, bandwidth, index storage space, etc.