|Trying to get rid of dupe content, but Google won't drop it|
I've got another random Panda problem that I'd like to get your advice on. We had a whole bunch of pages that contained duplicate intro copy. For years it was fine, and Google did a great job of ensuring the correct page ranked, but since we've been Pandalized, we've taken the opportunity to clean up. So first up, we removed that duplicate intro copy from all pages apart from the top-level page where it should appear. Following the removal of the dupe content, we realised that the pages were actually pretty weak, so we've just blocked them using the NoIndex meta tag.
This has resulted in some random weirdness. We didn't leave it long enough before implementing the robots.txt file, so this has resulted in Google no longer indexing these pages, but keeping the last copy of those pages (i.e. the ones with the dupe content) in their index. Therefore, if you do a site search looking for dupe content, it's still in Google's index, even though we removed it and blocked those pages over 7 weeks ago.
I'm now really confused as to what to do. We've removed dupe content, and blocked these pages, but potentially this could still be seen as duplicate content if Google are checking their indexed data. Should I reinstate the follow status to allow Google to recrawl these pages without dupe content before blocking them again? Seems couter intuitive, but maybe I just need to flush this dupe content from Google's cache. Thoughts?
Have you blocked the dupe pages using robots.txt as well? If so, that could be a problem, since Googlebot won't actually be able to reach these pages and read the "noindex" tag, and so these pages could stay in the index for a very long time.
It's best to just leave it noindex, *don't* block in robots.txt, and then let Googlebot do the job, however long it takes. Basically, only use robots.txt for new pages that you don't want indexed (and I would also have the noindex tag on these pages just in case), but don't use it for existing indexed pages that you want gone.
Unfortunately, I've found that Google has been quite slow in removing noindex and newly made 404 pages lately. I've had some 404's since late April, and some are still in the index, as are a few nofollows that were added in mid May. For some reason, de-indexing seems to have stopped or slowed around end of May/beginning of June, for my site anyway (despite seeing two crawl spikes in WMT since then).
A smart way to resolve this is to use a server-side 301 redirect to the actual page. I'm having a similar problem with a few pages with the url rewriting scheme I have going which allowed spaces to replace -'s. I've fixed that with PHP by taking the page url parameters, comparing them with the current requested URI and redirecting it they don't match.
I somehow don't think noindex works as well as some people suggest. 301s have had the best outcome in my experience.
If the total number of pages is fairly limited, I would have to agree with Andem - redirection is best - as long as the duplicate content does not involve the homepage.
I'm slightly surprised its even still there to be honest. Goog's gotten pretty good at delisting duplicate content as of late.
AG4Life could be onto something there. Also, have you considered adding a canocalisation tag? It's the 'official' solution, may preserve more value from any backlinks to the duplicate content than a 301 will. (Then again, it may not. I actually don't know for sure. Has anyone else got an opinion on this?)
The pages are required for pagination, so we can't 301 them unfortunately :(
I just need to get them indexed without the dupe content
almighty_monkey is right - the canonical link tag for deep pagination will preserve most of the backlink power.
|Therefore, if you do a site search looking for dupe content, it's still in Google's index, even though we removed it and blocked those pages over 7 weeks ago. |
It takes much too long before Google "cleans up" the index. At the end of March a website that copied more than 100 pages from my website got his account suspended by his hosting provider. The infringer gave up and since then his website is offline. The only thing Google sees now when checking the web pages is "account suspended" messages. This has been going on for months, but still that website is ranking for a lot of search terms. I think that this tells a lot about the quality of Google's index ... Bing removed the non-existing site from his index. Why can't Google do the same?
This is a classic example of the chicken before the egg. Something isn't quite right when it's you scrambling to re-organize pages to suit the search engines when in fact it's their problem to organize your rankings.
Anyway, slow down. Make minor changes and allow time for Google to pick up on the changes and let them propagate. Multiple changes in rapid order causes unforseen issues when some pages drop but not all. Most importantly is to keep working on content and ideas, in the end Google will do whatever they please.
I'm watching Google spider tens of thousands of pages now returning 410 Gone and seeing the indexed numbers going down slowly - and not on a day to day basis - in discrete clumps often more than a week apart.
@Sgt_Kickaxe Agreed - it feels really counter intuitive, having to block pages to appease Panda