joined:Apr 26, 2012
Recently our robots.txt file was deleted by accident. The file included a directive blocking an entire folder of e-commerce related pages. This directive was placed there before my time, but without the robots.txt, now the pages are dumping into the index.
The reason the folder was still blocked is because a few years ago we implemented a system that would better optimize those type of pages in the URL, whereas the URLs for the blocked folder are not that well optimized. It worked and increased visits and conversions greatly. Long story short, we still use both URL styles on the site, but due to the way things are setup, only one of them was being crawled and indexed.
Well, now the blocked pages are showing up in the index and are actually ranking and converting very well, which was surprising but welcome. It seems that when it has a choice, Google will more often rank the previously-blocked page rather than the optimized URL.
Because both pages are being crawled, this presents a huge duplicate content issue. Now we want to nix the optimized URL pages, but this will take some time due to limited resources and internal processes. In the meantime, I don't really want to re-block the pages in robots.txt, but I also don't want to deal with a duplicate content penalty in the long-run.
I know Google's help section says that it is capable of picking out the best page to index in this situation, and that duplicate content isn't liable for a penalty unless it's a deliberate attempt at manipulation (which, as a mistake, this wasn't), but I'm not sure. I've tried to block the non-optimized URL, but due to the way our CMS is currently setup, it blocks both pages. Changing how this works would also take time.
So my question is, is it better to block the non-optimized URLs and deal with a potential conversion hit for now, or to leave them, and hope Google will just decide which page to use and not penalize us for it?