Forum Moderators: Robert Charlton & goodroi
It goes something like this. The original URL is
example.com/example/ab.html
The URL generated for the EURO is
example.com/exampleproduct.asp?example=example&id=ab&AB1=EUR
I tried blocking in robots.txt with
Disallow: /exampleproduct.asp
But no luck. Google is still finding all of those currency pages and reporting them as duplicates. There are 20 currencies on a site with over 100,000 products, so do the math.
Any suggestions?
[edited by: Receptional_Andy at 3:42 pm (utc) on Oct. 30, 2008]
[edit reason] Exemplified URLs [/edit]
Additionally, you could use the robots checker within Webmaster Tools to verify that Google has a current copy, and that is blocked from accessing those URLs.
I checked Webmaster Tools prior to making this post. Everything looks like it should be working. It says the URLs are blocked, yet Webmaster Tools still reports a zillion currency URLs as being duplicates as far as meta descriptions. That tells me that Google is still seeing those pages.
I think what's happening is as a result of different things happening at different times, especially as the content may have been spidered as a result of google's more "creative" crawling processes [webmasterworld.com] - meaning there are no actual links to the URLs themselves.
The WMT reporting of duplicates operates on content within the index - regardless of whether that content is due to be excluded. It doesn't "know" that you've added lines in robots.txt. I suspect that because there are no links to the content, it will take Google much longer to get rid of the URLs, and even then, many of them will remain in the index as "URL only" listing, perhaps indefinitely. So, I think it's a waiting game.
The other consideration is that the WMT error doesn't necessarily matter - it doesn't indicate content that cannot perform in search results, nor does it imply that there will be a negative effect on site performance. I think it's safe to ignore the warning while waiting for the content to disappear. The only other possibility is that Google is not correctly interpreting the robots disallow, or is not correctly applying the rules to content discovered by methods other than standard links, but personally I don't believe that to be the case.
I've been watching these for several months, hoping that they would fall out, but so far, like visiting relatives, they don't know that it's time to leave!
And, yep, these issues starting popping up about the time Google got "creative" in its crawling.
WMT has lots of issues - some because it's considered an authoritative resource for Google information, and perhaps expectations are too high. But the duplicate detection is a very simple stat, and nothing like the process that goes into evaluation of pages for relevance in SERPs. Not that it isn't useful and usually worth fixing, but it doesn't necessarily have any real world impact.
And yep, it can take a long time for certain types of content to drop out. The only way to speed that up would be URL-removal, but I'm no fan of that and don't believe it would actually do anything useful other than tidy up site: search serps anyhow - I doubt you're getting any visitors landing on these pages.