--- At this point, the most common theme I see in wrongly Pandalyzed sites is that their original content is too widely syndicated and sometimes scraped. The signals that Google uses to identify the canonical source of the content are being swamped, and the result is that the original source starts to look like a scraper. ---
This is very interesting in my case. And I would appreciate all the conversation and feedback possible on the following. Sorry so long, but the devil is in the details (and the historic path) here.
I have been reading for days now about Syndication, scraping, and duplication.
The topic seems to have first arisen around 2008 and ALL the consensus around that time was that it was perfectly okay, and even encouraged... by everyone! Supposedly, duplication only penalized sites if the content was duplicated within their own domain, because it was assumed they were trying to game the system. In fact this is STILL posted on the Google policy pages (I just read it today.) Proceeding into current times (especially with Panda), it appears that has for the most part done a 180 reversal!
We have one friend/writer/PR Person who does two weekly columns. She has been doing them since the mid 90's at least. Originally it was a newsletter e-mailed to major and minor newspapers nationwide and many newspapers carried the entire column under her name as a syndicated column or were allowed to pick it apart and use parts as interest or space permitted. As I understand it, most writers DREAM of getting syndicated and paid in this manner. Her topic was MADE for our website and in 2000 we, with her full cooperation, created a section of our, (then extremely high traffic) website for her columns, bio, contact and archives.
I must mention, that her column consists mainly of a combined collection of the most important and targeted press-releases sent to her (as a recognized authority in her field) by various businesses in the field with new product announcements, events, photos, etc. with the desire to get included in her column and thus get much wider exposure than they possibly could on their own or even by posting on their own websites. Many are rejected. When they are included it is generally 92% the same word for word to ensure accuracy (usually at their request).
Until Panda this had worked great for everyone. The businesses efficiently get their info out to interested parties and the media outlets get a periodic TIMELY, source of targeted information without having to search all over the net for it. Some of the info is EXTREMELY time critical (about 1/3 is useless within 2 weeks) and simply couldn't be found quickly or effectively on Google even if you knew what to look for. While other parts are of interest many years later and historically. Her columns have always been ranked highly when keywords from them were searched, her photos are used by many as avatars, and her pages are so heavily linked by other sources it is incredible. She has hundreds of KNOWN followers (auto-notified) and possibly many more based on direct accesses.
So to emphasize, the column has always been substantially copied from many sources, but has never been created from a single source, but usually about a dozen per week average, who contributed their content willingly, along with ~35% additional original hand-written commentary. The full column runs about 15-25K in file size weekly. She also still syndicates the column by e-mailing the weekly content to the same print media as well as a couple of websites who may use more or less of it as they decide by permission. We have always been the recognized "source" of the FULL column, and a few other sites recently have been permitted to run (syndicate) the column in full on their websites (none of these - at least until lately - had a ranking anywhere near our original site). We USUALLY have it up first, but since we get it about the same time of the others, we can't always guarantee it. Plus, at least lately, we are not always the first crawled by G anymore, it appears. Most of the archives (from say 2000-2009) have virtually no information contained anywhere else because either the info expired or it was never on the net to begin with.
Since Panda, it APPEARS this is now frowned upon. For one thing, many of the original sources (the businesses themselves) are adding the same info to their own websites (sometimes AFTER her columns come online). Secondly we have a number of sites running the same column in full. In most cases (until this last week) her original column always came up first in the SERPS, but not so much any more.
The rest of our site is totally original, written by us, although we have lately found whole pages copied on 80K other sites. The site has lost at least one PR sitewide and has been obliterated from G more with each major Panda update. As of today we don't even show up for a paragraph of our home page text unless we type in site: first, although over 80,000 other spam sites who copied it do. And G insists there is no manual penalty to reinstate.
So what to do? Do we tell the original companies they can't post their own information on their own websites or send it to anyone else who might? Do we tell our friend/writer of many years that she has been banned/obsoleted by G ("You are insignificant you shall be exterminated" (is that a Dalek quote?)).
We have pretty much decided to move her OFFICIAL COPIES of the columns to a separate "burnable" domain and 301 redirect to them from their old location. Of course that loses us all those legacy links and content, but what can you do? Will that work to get any perceived duplication off our site? Or will we still be penalized by association with the myriad of other website links out there? We considered having her link the individual businesses "original" inline version with her title and commentary and original or supplied photos (if they have one at the time of publication - or to someone else who has already posted that segment) but since we archive them for many years, that becomes an ever-growing battle of checking broken links and spammitized domains. We also thought of keeping it on the main domain and asking her to rewrite the column with links to the segments on the burned domain, but then we would have all these links to a "bad neighborhood". We considered a piecemeal method where we keep the archives in place (since nothing there is elsewhere on the net) and place the newer (~12 months) on the burned domain, then redirect them back later when the companies and other sources remove their copies of the column, but how do we determine "when the coast is clear" for avoiding a duplication penalty.
So what do you all think? Personally I'm of mixed opinion.
SHOULD such usage be considered duplication when the providers WANT us to do it? IS it considered duplication by the latest Panda? If so, any ideas on keeping her content from dragging down from our prior rankings while retaining incoming traffic links and PR? Should G be far better at determining who is the originator? Do we need to collect permission from all the contributing sources and submit them. Most, importantly, is there anyone who has similar content who has NOT been Pandalized at this point.
Has Google Effectively abolished writers' hopes of net syndication? If publishers know they will be punished, why would they ever accept syndicated outside content again? (Unless you are a TRUSTED top 10 media outlet I guess... good luck getting syndicated there). I hope they realize this will ultimately increase the cost and quality of current info articles to the major media, since writers being paid by only one source for completely original content will not be able to spend as much research time to produce it (since they will have to write 10 times as much) and will be looking for higher wages.