|Can Duplicate Content in E-Commerce Pages Lead to a Penalty?|
Recently our robots.txt file was deleted by accident. The file included a directive blocking an entire folder of e-commerce related pages. This directive was placed there before my time, but without the robots.txt, now the pages are dumping into the index.
The reason the folder was still blocked is because a few years ago we implemented a system that would better optimize those type of pages in the URL, whereas the URLs for the blocked folder are not that well optimized. It worked and increased visits and conversions greatly. Long story short, we still use both URL styles on the site, but due to the way things are setup, only one of them was being crawled and indexed.
Well, now the blocked pages are showing up in the index and are actually ranking and converting very well, which was surprising but welcome. It seems that when it has a choice, Google will more often rank the previously-blocked page rather than the optimized URL.
Because both pages are being crawled, this presents a huge duplicate content issue. Now we want to nix the optimized URL pages, but this will take some time due to limited resources and internal processes. In the meantime, I don't really want to re-block the pages in robots.txt, but I also don't want to deal with a duplicate content penalty in the long-run.
I know Google's help section says that it is capable of picking out the best page to index in this situation, and that duplicate content isn't liable for a penalty unless it's a deliberate attempt at manipulation (which, as a mistake, this wasn't), but I'm not sure. I've tried to block the non-optimized URL, but due to the way our CMS is currently setup, it blocks both pages. Changing how this works would also take time.
So my question is, is it better to block the non-optimized URLs and deal with a potential conversion hit for now, or to leave them, and hope Google will just decide which page to use and not penalize us for it?
It would likely be best to redirect them externally via 301 if possible, but if it's not I think I'd probably leave them, put the "friendly" URLs in an xml sitemap, but not the URLs that were previously blocked [Google takes XML sitemap inclusion of one version of a page but not another duplicate of the page as an indication of the canonical location for the information] and then I'd set a Link: <http://www.example.com/friendly/url.ext>;rel="canonical" header via PHP or .htaccess.
Yes, as JD_Toims said, rel="canonical" is your friend here. You can set it up as explained above or you can add <link rel="canonical" href="full-URL-I-want-indexed"> in the head section of the page.
It does not matter if this declaration in the <head> appear on the both URL versions of the page, having rel="canonical" pointing to the own URL is fine.
Could it be that the search engine sees them as a new pages because it hadn't seen them before? And, that's why they rank now?
JD_Toims and aakk9999:
Thanks. Unfortunately, the way things are setup now, I can't easily just choose one type of URL or another to use canonical, otherwise I'd already have used this to meta noindex the non-friendly URLs. I also cannot redirect because I still need the other type of page (it's a complicated setup).
That crossed my mind. Perhaps the pages will rank lower once they're no longer considered fresh?
|(it's a complicated setup) |
That's why I suggested .htaccess or PHP to set the canonical -- There has to be a correlation somewhere between the two different URLs for both to be populated with the same information, so there has to be a way to "draw the same conclusion" with [most likely] PHP and "find the friendly URL" to put in a header or <link>.
As far as noindexing goes, if you can block the non-friendly URLs in robots.txt you can easily noindex via .htaccess in the directory you had disallowed by robots.txt.
Header set X-Robots-Tag: noindex
Edited: Forgot <Location> is not available in .htaccess, so FilesMatch with an implicit catch-all should do the trick.
Thanks again. I'll have to see what we can do. We're actually on an IIS setup (ASPX site), so I know .htaccess isn't supported.
Ah, gotcha and my sympathies lol -- What might be easiest is adding a new col [say "canonical"] to your database in a table both versions of your pages access, then write a simple bot that runs through a list of your friendly URLs and makes a request for the info, but rather than displaying the info inserts the current URL into the row it got the information out of.
Then you'd have the canonical location readily available for both versions of the page and could just grab it when you grab the other info and drop it on the page or stick it in a header.
Was your XML sitemap updated with the new URLs and were the old URLs removed? Perhaps, you're accidentally sending both URLs in the sitemap and/or have too many URLs for it.