This is related to the prior thread about RSS feeds causing site to not rank, but since the moderator was specific about the RSS there, I started a new thread since this does NOT involve RSS.
In our case we have posted a weekly original article by a journalist friend for the last 12 years. It used to be syndicated in parts in national print papers, which was fine as it rarely made it to the internet in entirety except for us.
Well recently, without our knowledge, she went and started sending the same content and photos to a number of other websites for them to post as well. Coincidentally (or was it?) she apparently started right around April 2011, and we didn't realize it until this week. We got hit really hard by Panda in April and again later and believe it may be directly tied to this. I don't know if these other websites are even aware of SEO and Panda details. I'd say one may be but the other definitely is not.
Anyway, we are now in the same boat as others wherein searching for a complete quoted sentence puts us in the absolute last supplemental result below the others that recently started posting her articles. In our case we add a lot of editing of the text and processing of the photos and formatting of the textual content to make it better for the reader, whereas the other sites just slap it up with ten font changes on the page as she writes it, misspellings, dynamically re-sized photos, etc. So it probably takes them 5 minutes while we take 2-3 hours per column.
So given that she sends it out at the same time to everyone, we can often be a couple hours behind the others in posting the OFFICIAL version or more. Some of them are using Wordpress and I believe maybe even claiming authorship via it's options, etc. In our case, we have an agreed set weekly posting date, but she sends it to us (and them) a few days early so we have time to process it, but they slap it up as soon as they get the raw content!
But not always. But even when we are the absolute first to post, and we show up in the SERPs first and earlier, they always eventually come out on top when G finally does crawl theirs. Our home page is a PR4 while none of theirs top PR2. Our column pages are all at least one PR rank higher than theirs, Our Alexa rank is 1/4 of any of theirs, yet somehow we are being outranked. (Yes I read about the column hijacking article linked in another recent thread.)
It just does not make sense. We also have the domain for her column and are considered the primary source, although for historical purposes it points to a subdirectory on our main site. BTW this only occurs on G, the others treat our version far more fairly.
For various reasons we can't make her understand the impact of this practice. Being old school she still sees it as the more places she is posted/read/exposed, the better - and in part she is right - but it could be KILLING us in the process. Nor can we really ask her to change the practice of sending it out to others at the same time. And from the looks of it I don't think getting it earlier than the others would help much either. We can't DMCA the other sites because she is giving them permission to use it. And at the same time we are friends of friends going back a long ways and don't want to dump her site for her altogether (although if this doesn't work we probably will have to.)
So, that all explained, rather than play this apparently unbeatable speed/PR race, we've decided to take another approach to attempting to get rid of our duplication penalty issue.
We don't want to 301 or 404 all the old columns for the last 2 years in case this turns out NOT to be the cause of our Panda penalty, or if some day we talk some sense into her, get rid of the other sites, and we later want to restore them and some semblance of their ranking. Plus the older pages (older than 2 years) currently rely on the newer ones backlinks to be found by the search engines (which we will be fixing by adding archive indexes for each year, down the road).
So the only solution I've been able to come up with is to insert a "googlebot=noindex,follow" tag in the header of each column from the last 2 years and going forward. This way the official version of the content will still be available to our readers and for her e-mail list readers, but, I'm assuming, should not penalize us for duplicate content anymore. We don't really care if these pages are found in the SERPs as all of OUR readers come weekly to the primary entry page for her site to access the latest column.
Before we undertake this large and scary task, my questions are:
1. Are my assumptions all correct or am I missing something.
2. If we do a noindex,follow in the file header will this definitely stop the duplicate penalty. I've seen instances where G is still indexing pages, or at least still knows about them even if it is not including them in the SERPS.
3. If later we remove the noindex will we be able to get these re-indexed without much penalty?
4. Each column links back to the main title page and to our home page as well as the prior week's column - with the noindex,follow will we retain PR pass-through from other sites linking the individual columns?
5. Should I canonicalize the weekly columns to the main title page (they are not duplicates of it but ARE older versions - with totally different content)?
6. Is there any way that the columns can be made to come up in our on-site G Adsense search since they will not be in the general database?
7. Will G Adsense still be able to determine what the page is about in order to server appropriate ads?
Is there a better way or other solution (other than just forgetting about the column altogether and short of knocking some sense into the writer)? Something simple I haven't thought of?
I thought about breaking out her domain name to a separate account and let IT get Pandalized on it's own if she is this stupid, instead of us, but that opens another whole can of worms (linking to bad neighborhoods for instance) and loses us a LOT of the PR from the older pages.
Thanks to anyone who has read this far down :)
[edited by: tedster at 4:30 am (utc) on Nov 15, 2012]
[edit reason] added line breaks [/edit]