| This 53 message thread spans 2 pages: < < 53 ( 1  ) || |
|How do we tell Google we were wrongly Pandalized?|
In Matt's youtubed keynote speech from Pubcon he specifically says that if we feel we were unfairly demoted in Panda, to "LET THEM KNOW" about it so they can refine the algorith. Does anyone know of the effective procedure to do this or DID you have to pay the big buck to attend Pubcon in order to do this?
--- At this point, the most common theme I see in wrongly Pandalyzed sites is that their original content is too widely syndicated and sometimes scraped. The signals that Google uses to identify the canonical source of the content are being swamped, and the result is that the original source starts to look like a scraper. ---
This is very interesting in my case. And I would appreciate all the conversation and feedback possible on the following. Sorry so long, but the devil is in the details (and the historic path) here.
I have been reading for days now about Syndication, scraping, and duplication.
The topic seems to have first arisen around 2008 and ALL the consensus around that time was that it was perfectly okay, and even encouraged... by everyone! Supposedly, duplication only penalized sites if the content was duplicated within their own domain, because it was assumed they were trying to game the system. In fact this is STILL posted on the Google policy pages (I just read it today.) Proceeding into current times (especially with Panda), it appears that has for the most part done a 180 reversal!
We have one friend/writer/PR Person who does two weekly columns. She has been doing them since the mid 90's at least. Originally it was a newsletter e-mailed to major and minor newspapers nationwide and many newspapers carried the entire column under her name as a syndicated column or were allowed to pick it apart and use parts as interest or space permitted. As I understand it, most writers DREAM of getting syndicated and paid in this manner. Her topic was MADE for our website and in 2000 we, with her full cooperation, created a section of our, (then extremely high traffic) website for her columns, bio, contact and archives.
I must mention, that her column consists mainly of a combined collection of the most important and targeted press-releases sent to her (as a recognized authority in her field) by various businesses in the field with new product announcements, events, photos, etc. with the desire to get included in her column and thus get much wider exposure than they possibly could on their own or even by posting on their own websites. Many are rejected. When they are included it is generally 92% the same word for word to ensure accuracy (usually at their request).
Until Panda this had worked great for everyone. The businesses efficiently get their info out to interested parties and the media outlets get a periodic TIMELY, source of targeted information without having to search all over the net for it. Some of the info is EXTREMELY time critical (about 1/3 is useless within 2 weeks) and simply couldn't be found quickly or effectively on Google even if you knew what to look for. While other parts are of interest many years later and historically. Her columns have always been ranked highly when keywords from them were searched, her photos are used by many as avatars, and her pages are so heavily linked by other sources it is incredible. She has hundreds of KNOWN followers (auto-notified) and possibly many more based on direct accesses.
So to emphasize, the column has always been substantially copied from many sources, but has never been created from a single source, but usually about a dozen per week average, who contributed their content willingly, along with ~35% additional original hand-written commentary. The full column runs about 15-25K in file size weekly. She also still syndicates the column by e-mailing the weekly content to the same print media as well as a couple of websites who may use more or less of it as they decide by permission. We have always been the recognized "source" of the FULL column, and a few other sites recently have been permitted to run (syndicate) the column in full on their websites (none of these - at least until lately - had a ranking anywhere near our original site). We USUALLY have it up first, but since we get it about the same time of the others, we can't always guarantee it. Plus, at least lately, we are not always the first crawled by G anymore, it appears. Most of the archives (from say 2000-2009) have virtually no information contained anywhere else because either the info expired or it was never on the net to begin with.
Since Panda, it APPEARS this is now frowned upon. For one thing, many of the original sources (the businesses themselves) are adding the same info to their own websites (sometimes AFTER her columns come online). Secondly we have a number of sites running the same column in full. In most cases (until this last week) her original column always came up first in the SERPS, but not so much any more.
The rest of our site is totally original, written by us, although we have lately found whole pages copied on 80K other sites. The site has lost at least one PR sitewide and has been obliterated from G more with each major Panda update. As of today we don't even show up for a paragraph of our home page text unless we type in site: first, although over 80,000 other spam sites who copied it do. And G insists there is no manual penalty to reinstate.
So what to do? Do we tell the original companies they can't post their own information on their own websites or send it to anyone else who might? Do we tell our friend/writer of many years that she has been banned/obsoleted by G ("You are insignificant you shall be exterminated" (is that a Dalek quote?)).
We have pretty much decided to move her OFFICIAL COPIES of the columns to a separate "burnable" domain and 301 redirect to them from their old location. Of course that loses us all those legacy links and content, but what can you do? Will that work to get any perceived duplication off our site? Or will we still be penalized by association with the myriad of other website links out there? We considered having her link the individual businesses "original" inline version with her title and commentary and original or supplied photos (if they have one at the time of publication - or to someone else who has already posted that segment) but since we archive them for many years, that becomes an ever-growing battle of checking broken links and spammitized domains. We also thought of keeping it on the main domain and asking her to rewrite the column with links to the segments on the burned domain, but then we would have all these links to a "bad neighborhood". We considered a piecemeal method where we keep the archives in place (since nothing there is elsewhere on the net) and place the newer (~12 months) on the burned domain, then redirect them back later when the companies and other sources remove their copies of the column, but how do we determine "when the coast is clear" for avoiding a duplication penalty.
So what do you all think? Personally I'm of mixed opinion.
SHOULD such usage be considered duplication when the providers WANT us to do it? IS it considered duplication by the latest Panda? If so, any ideas on keeping her content from dragging down from our prior rankings while retaining incoming traffic links and PR? Should G be far better at determining who is the originator? Do we need to collect permission from all the contributing sources and submit them. Most, importantly, is there anyone who has similar content who has NOT been Pandalized at this point.
Has Google Effectively abolished writers' hopes of net syndication? If publishers know they will be punished, why would they ever accept syndicated outside content again? (Unless you are a TRUSTED top 10 media outlet I guess... good luck getting syndicated there). I hope they realize this will ultimately increase the cost and quality of current info articles to the major media, since writers being paid by only one source for completely original content will not be able to spend as much research time to produce it (since they will have to write 10 times as much) and will be looking for higher wages.
I want to emphasize that my ideas about correlation between widespread syndication and being wrongly Pandalyzed are my own conjecture, nothing proven and nothing officially communicated. It's just what seems to make the most sense for the cases that have me scratching my head. The idea comes from noticing a number of factors - the most important of which is what sites are now ranking for searches where the original source used to rank. And if it's accurate, it is NOT what Google would intend... instead it's what they still struggle with.
Remember that one month before the first Panda release Google rolled out their "Scraper" update [webmasterworld.com] At the time, I considered it an essential fix, and until Panda launched it seemed pretty darned good.
Since Panda 1.0 and up to right now, it really hasn't looked all that good to me. The situation has improved over the year, but scrapers (and syndicated sources) still outrank the original too often.
What I'm trying with one site is to ramp up every "we are the canonical source" signal I can muster, including authorship tagging, pubsubhubbub, delayed RSS, no more full RSS feeds, etc, etc. I'll let the forum know if it works.
-- Remember that one month before the first Panda release Google rolled out their "Scraper" update [webmasterworld.com] At the time, I considered it an essential fix, and until Panda launched it seemed pretty darned good. --
Agree here! Jan/Feb were the best in ~9 months, then... down the drain.
If you write about things other people write about it's not impossible for Panda to get confused and think you copied it (identical) or plagiarized it (very similar).
The tragedy is this can happen to the writer who didn't look at other sites' pages and the writer who did.
It is more likely to happen when:
a) the page is short;
b) the subject matter lacks details;
c) the writing style is very basic; and/or,
d) somebody else wrote about the subject before you did.
Some sites still Pandalized have already made changes related to a) and b); it may be worth looking at c) and d).
Topics already covered or even beaten to death aren't of much interest to Google. Panda hates copied text and perhaps also copied topics.
I know very little about anything so my future on the web doesn't look great right now. I have no idea how I'm going to come up with lots of unique content and original analysis written in a distinct style.
Vince and Panda are each huge burdens but together can virtually put you in the gulag camp. You can do tons of work with very little if anything to show for it. For me Vince and Panda created a huge chilling effect on starting new projects.
A year ago everything looked so different. Then the web looked like my future; now it looks like my past. I used to have the confidence that if I did good, hard work, I'd get results. That confidence has eroded each month since Panda was introduced.
|SHOULD such usage be considered duplication when the providers WANT us to do it? |
The thing is, it's not really the providers that are the concern here, it's the users. In a situation like yours, why would Google (or any search engine) want to give equal or near-equal weight to multiple copies of the same information? I can tell you that when I'm searching, if I type in a string and see 10 or 20 results that are exactly the same thing residing on different domains, I am not a happy camper.
Someone needs to come up with a new model for syndication. I don't know what it is (and I am actually on both sides of the issue on various sites I own and/or run for clients) so I can see the issues both ways. And I can even understand where Google is coming from about not wanting to show 200 copies of the exact same thing (although they haven't figured out how to get it right yet either)
It's been a major problem for me that people steal my pages and then syndicate them. It's one of the reasons I didn't bother trying very hard to clean up the mess with DMCA's last spring, the problem seemed insurmountable.
But I finally wised up and stopped submitting DMCA complaints to site owners, which takes forever and is very unsatisfying. Now I use WebmasterTools DMCA dashboard and after a few iterations, I'm beginning to see some of my most infringed pages cleaning up nicely. Maybe that's why Sunday was looking up after the Panda update. But in any case, I'm putting in a couple hours a day filing with Google now until all of my pages come back clean (except for scrapers).
Netmeg: why would Google want to give equal or near-equal weight to multiple copies of the same information?
OK, say you wanted to know about what is happening THIS coming weekend (OR in the next 2 months, or last week) in town A, where you are going to visit from out of town, OR in the nearby area thereof? Of course you GO there and pick up a free rag from the street racks that lists everything. But what would YOU search for on G? What if you wanted to keep up with what was happening in town A, where you visit frequently (or are originally from), while you are not there?
So you might type "townA event" or "TownA News" of course you probably have to include "November 22" "November 23", etc in separate searches, then of course you would have to do separate searches for "townA party", "townA concert", etc. oh and don't forget TownB, TownC, etc a couple miles away. Of course 75% of these entertainment businesses are not SEOs and don't have the time or money to hire one, yet don't they deserve to get the word out of their offerings, and don't you think people on the net want to know about them. But If no one compiles all the upcoming events for the coming week(s), how would anyone know to search to find that Joe'sChickenHut (who may or may not even have a website - only an online press release at an obscure PR company site or an ad in the local newspaper website) is having a 5th anniversary free shindig, or a starting out local indie band is performing free at the MetroStadium, or WellKnownGroup is having a concert after-party at LilMomsBar.
A site page which collects all these submitted press releases on a weekly basis, for townA, adds comments and minor info missing from the original press releases, adding photos, etc. and which comes up under a simple search for "TownA Current Events" or "TownA News" would be extremely useful, No? Keeping an archive of them allows people to catch up with past events (when/where was the last time TheBandBand played in townA, how much were tickets last time, Wish I had a photo of them, Uh-oh, Old Joe died, I wonder if there is a picture/story about him when he was alive, oh no HIS own website has been shut down...) Many newspapers in fact do just that as well online. But does G recognize EVERY online newspaper or media outlet, or e-zine? Maybe they need a registry?
That is just one so-so example, I can think of many others, where the collected and ARRANGED SUM is worth FAR more than the otherwise hard to find parts.
Remember, Not many AVERAGE SEARCH ENGINE USERS out there are a "search whiz" like you and I, or WANT to spend the time doing 50 searches. I'm not sure G realizes this either. Most over 40yo's haven't a clue how to search properly. If someone IS willing to collect and compile to make life simpler for them, why shouldn't they be found. There MAY even be a valid argument that collections of like information should be MORE IMPORTANT than the pieces. I agree, If someone searches on "Joe'sChickenHut 5th anniversary free shindig" by all means Joe's original press release should definitely come up #1, it may even have more/newer details too long for the compilation site.
How about weekly approved patent title/summaries in a particular niche? Who has time to search through every new patent online to categorize them or know all the new terms they may need to search under?
How about the weekly submitted bills from the congressional register dealing specifically with animal control? (well that might be searchable at the congress site... if you knew how or all the various terms to look for).
I know I personally find sites with collections of copies of old user manuals extremely valuable, since frequently the original manufacturers remove them from their own sites, or go out of business. If you don't collect them and save them they'll be gone before you know it.
Anyway, I think there are quite a few examples, and I'm not sure the Google Alg is really qualified to tell which are useful and which are not, and thus should not penalize for authorized copy of fragments creating a useful whole.
|I'm not sure the Google Alg is really qualified to tell which are useful and which are not, and thus should not penalize for authorized copy of fragments creating a useful whole. |
It's been clear in the various leaked quality rater documents that Google does acknowledge the potential value-add that various kinds of mash-ups can offer. Whether this is accurately accomplished via algorithm in every case or even most cases, is of course up for dispute.
If I were offering that kind of website, I would take care to include a little something more than just the reprints of the excerpts - as a kind of insurance policy, yes, but also as a kind of contextual orientation for the visitor.
|I would take care to include a little something more than just the reprints |
What happens when someone extensively quotes another online publication, but also gives prominent credit acknowledging that website and/or author immediately before or after the quote? Is there any indication that Google takes that into consideration, and thus would not penalize the site with the quote?
There's evidence that Google "wants to" credit the original source in the SERPs, but many times a more authoritative source who is quoting in full or syndicating (even with full acknowledgement) will still rank higher.
The fact that Google does not want to return two copies of the same content on page one is certainly understandable. A lower ranking for any particular page isn't the same as a penalty on the website, it just means it doesn't rank as well for that bit of content.
I think the public communications we've seen this year - including the "scraper update" in January, plus authorship mark-up, plus the scraper site reporting form - all add up to a confession that Google still struggles with identifying the "canonical source" for any specific bit of content.
tedster: agree 100%
In our case this is only one portion of a large site, but G has indicated that with Panda, relatively small misbehaving sections can affect the ranking of an entire site.
After studying the forums here and in a couple other places (Google, SEO--) it seems agreed (me as well) that a lot of people who think they have been pandalized aren't being so for the reason they think. I also think a lot of the fault is G's, at least I'm pretty sure it was in our case. We had dropped from pg 1 to pg 30+ for ALL our primary keywords which previously G had us 6-packed for. After a week of research I determined the reason and we are now on our way back up higher daily and up to pg 3-6 in only a week.
In our case we discovered G had indexed duplicate copies of some of our pages, but it wasn't our fault in all the cases I could find. In one they had indexed example.com/abc.htm AND example.com/abc.htm/ pointing to the same page, so of course they were identical. As far as I know the later is not even a valid URL and certainly didn't come from our site links. I 301 redirected it in the htaccess and resubmitted it, so it got removed. In another case they were indexing a copy of our home page, which we use as a landing page just for adwords, and IS an exact copy of the home page (minus Adsense) but has had <META NAME="robots" content="noindex"> in it (Google's own documented preferred way of removing from the index) from the first day it was uploaded! They apparently picked it up from our sitemap since it is not linked anywhere else, except in Adwords searches, but I had to remove this through the webmaster tools page. In most cases these were nearly impossible to find since even when you do a site: search G literally HIDES them way at the end in the supplemental results, and I only managed to find most of them by shear luck. They often don't even appear when you search for a full unique paragraph and on site: only! I'm a little afraid of these new G generated titles being mistaken as duplicate pages by their own algo as I have seen our same URL coming up trice in the same search under both titles on different pages. G REALLY, REALLY needs a duplicated content report.
Anyway my point is the syndication/copying issue may not be as big an issue as thought, and what G was referring to as "duplicating content" may in fact apply ONLY to "on the same site", which we CAN control (although G doesn't make it easy with errors like this). I DO believe that Panda has drastically tightened the penalties on internal copying which is why we never saw any declines in Jan/Feb on our site (in fact we went up 35% then) but we (and likely others affected because of accidental or on purpose internal duplication) have seen drastic declines with each reported panda algo change. The difference being, in the past if you had a duplicate page by accident, only that one page pretty much dropped out of sight, now one or two key pages duplicated in the index can put your ENTIRE SITE off the SERPs with no easy way to figure out which page did it! At least in my experience. If anyone can't figure out why they have been Pandalized to page 30+ I would recommend G's indexing errors should be their first place to look, and their own directories for accidentally miscopied duplicate files.
Also, for the sake of possibly helping someone else stymied by their own stupid mistakes, we also realized on another affected domain that when we removed expired information, we were routinely keeping the URL (which was much linked from other sites) and replacing it with a 90% identical template stub each pointing to the same primary info page... Uh-Duh! Not intentionally intending to spam/cheat, just trying to help other webmasters who had linked, and to retain Backlinks, and in general NOT thinking (they add up over time)! Now we 301 redirect all the expired pages in that area to one single "Expired.htm" page to retain the backlinks while avoiding duplication.
Good Luck all, will report if/when we get back to prior rankings
|What happens when someone extensively quotes another online publication |
Well, one of the things that happen is that the online publication worries that they've been Pandalized because of copyright infringements and spends a month filing DMCA complaints in WT dashboard.
I'm not going to go deep into fair use here, I'll just say that if you aren't getting permission first, "extensive" quotation will always be seen as a copyright infringement by the law unless it's true editorial use, ie, you're breaking it down line by line for expert critique or ridicule.
In the past couple days, I've had to deal with multiple instances of schools gathering my work into PDFs for course materials and then letting it get online, with sites built from nothing but spreading sections of my pages around a template, and TWO clowns who published eBooks with pages from my website, and were selling them through Amazon and Google Books.
If you're in the journalism business, you've probably had training or trained yourself on fair use. If you're not in the journalism business, don't copy from other people's websites and you won't have to worry about it.
As to duplicate content being an onsite or offsite issue, as you can guess from my little rant above, we have no duplicate content onsite, so if it's been the cause of our Panda penalty, it's from offsite. And we did see an uptick with the Panda update a few days ago, which may just be the result of cleaning up a chunk of the infringements.
As far as duplicate content is concerned, do people think that gathering small snippets of customer reviews from Amazon would result in a Panda slap?
Most of the reviews on my sites are routinely 1000 words or more but maybe 50-150 words is copied and pasted from Amazon customer testimonials.
I am trying to find a common theme among sites that were penalized but the only problem is there are some sites in my portfolio with the same characteristics that have taken a meteoric rise in the SERPS since October.
|I am trying to find a common theme among sites that were penalized |
The only thing I can identify as a way out of Panda hell. At least from all the reports of people that get out of it. It to fix every single error reported in G's WMT. Along the way it will turn you on to problems you never thought of.
I just started doing that this week. Since my original Panda (Feb.) flag I fixed literally 10,000 problems I thought that caused my flag. After logging into a deeply going into WMT, I found things that I never thought of. Things only a bot would see are the problem.
I've had numerous errors pop up in WMT recently, mostly as a result of actually trying to fix my sites.
Most of the time are 404 errors due to merging content and deleting low quality stuff. I'd probably take more of an interest in fixing those sorts of errors if I knew how to fix them.
I guess it is prudent to learn now.
|That is just one so-so example, I can think of many others, where the collected and ARRANGED SUM is worth FAR more than the otherwise hard to find parts. |
Are you sure this is what is causing your Panda problems? Because I have a number of sites like this in multiple niches and have had no Panda issues.
That's not really what I was talking about with regards to duplicate content; I was talking more syndicated articles, where the content was word-for-word (or close) exactly the same.
I am trying to find a common theme among sites that were penalized
I have searched too, but i didnt find things in common. My conclusion was it has to do with user satisfaction or user first impression on site, like it has been discussed in previous threads.
|I am trying to find a common theme among sites that were penalized |
That is old Google, not new Google ~ Panda was created to prevent the discovery of common themes because Google knows all too well that if that were still possible, they'd be right back where they started, which is to say, certain websites or SEOers could game the algo and undeservedly benefit. Any "common theme" that anyone thinks they discover will be countered by a thousand examples where it does not apply. In PandaWorld what works for one website will not work for another; what does not work for one website will work perfectly for another ... it's like falling down the rabbit hole, nothing is as it seems.
Actually we have BOTH issues simultaneously in that section of our site: The writer is conglomerating a paragraph or two from 15-20 announcements from various other published and unpublished Public Relation sources and then publishing the collection on a subdirectory of our site, AND then syndicating the whole collection to other sources, who use all or part of it for online and offline re-publication.
And NO, I'm no longer sure this is what was causing our largest issue. Since we have removed the other mentioned "(not-so) obvious" Google-duplicated pages (mentioned a few posts above), we are steadily re-surfacing daily. Progress report is looking up although our home page is so-far only half-way back in the index. I expect it will take time, as it took 7 months to drop to current levels. But pages which had dropped to pg 25+ in the SERPS are now up to pg 3-6 and still rising daily. I think it just takes time for the algo to reapply the lost PR re-iteratively across all the pages.
I'm currently of the opinion, that onsite duplication is the primary issue in Panda and it could be issues (duplications) that have been around for years (or newly PERCEIVED duplications as in our case with G mistakenly re-indexing pages, see prior posts) and which are only now being given far stronger leveraged weight. Whereas previously if you had a copy of a page that you accidentally copied a duplicate of to the wrong directory years ago, which was being picked up by your sitemap.xml and indexed by G, it only penalized THAT page from the rankings and you scarcely noticed it. Nowadays, it seems they've multiplied the effect of them, and if you get as many as a handful of those on a large site it is causing the entire site to be multiply-penalized across the board. My guess is they at least eliminate most or all PR pass-through from both those pages. Make it, by chance, the high PR home page and one or two other high PR pages and suddenly you're leveraged into the dumpster.
My recommendation: If you are intentionally (or unintentionally or stupidly, like us in one instance) generating redundant pages with only a word or two different on each to (intentionally or unintentionally) gain "content" and +1 PR juice each - STOP! Combine them if possible, etc. If not, go through your site with a fine-toothed comb and look for any pages 90%? or higher duplicated on the same or any other directory on the SAME DOMAIN, before worrying about other domain syndication. Once you find them, remove them or noindex them and re-submit the highest ranked ones using the crawl-as-googlebot/URL- submission and the URL-removal features of Webmaster Tools for fastest turnaround. Then be patient, it could take a couple weeks.
|That is old Google, not new Google ~ Panda was created to prevent the discovery of common themes because Google knows all too well that if that were still possible, they'd be right back where they started, which is to say, certain websites or SEOers could game the algo and undeservedly benefit. |
I'm not sure that is correct within a specific niche. Armed with the knowledge that Panda is looking for certain aspects of user experience it is possible to see which sites have benefited from Panda in your niche and what are the things they have that seem to have helped. Perhaps video content or really well structured and written illustrated articles with links to good citations etc.
Another thing to consider is the none Panda parts of the algo. Backlinks, and backlink anchor text are still incredibly important and, depending on how competitive your niche is, it may be possible to outweigh the Panda effects by really good offsite promotion.
I'm trying to do a mix of both.
Let me be clear in saying that, as always, it can be very helpful to do the things you suggest, plus improve the quality of overall writing, aim for faster page downloads, etc. But I see all that as good general advice for everyone, as opposed to finding a "common theme", which to me implies a kind of recipe for success. At this point I tell people to make their site to the best of their ability, as opposed to "knocking it out" in some template fashion, and keep making genuine improvements as they seem useful. That may or may not help with Google ~ with Panda nothing is assured ~ but it certainly will help with the site visitors, and that in and of itself is enough of a reason.
|Perhaps video content or really well structured and written illustrated articles with links to good citations etc...Backlinks, and backlink anchor text are still incredibly important |
|Agree here! Jan/Feb were the best in ~9 months, then... down the drain. |
Ditto, in january and feburary it appeared we were finally climbing out of the hole that had been dug for us in their last update and were were overjoyed. Then wham.... in the toilet when our ecommerce site. Meanwhile our competitors are reaping the benefits of the templated multiple websites and getting trashed in the results by amazon.
|I'm currently of the opinion, that onsite duplication is the primary issue in Panda |
I don't really think that's the case, but I hope your strategy works.
| This 53 message thread spans 2 pages: < < 53 ( 1  ) |