| This 98 message thread spans 4 pages: < < 98 ( 1  3 4 ) > > || |
|Panda Loss Because of Scraped Content?|
My site was hit by Panda in April 2011. The site was created in 1999 - and all content is original, written by me.
The site ranked very well until April 2011. My question is - after improving the site for 18 months, I have seen no recovery. Ziltch. As a matter of fact - after all the improvements - the site got hit again recently by Panda 20 on September 28th.
So I continued digging around and trying to figure out the issue since Panda is all about duplicate content and low-quality content.
This is what I found - and what I'm wondering could be the issue:
1. I started checking content in Google for every page on my site. The whole copy and paste with quotations - using a unique sentence on each page and then doing a search.
I've done around 45 pages and the results are mind blowing. My content has been copied SO many times - it's incredible. Especially my really old content - like anything written 1999-2006. But not exclusively.
Some posts/articles have been copied 20+ times.
My site does NOT rank at all or ranks at the bottom for my OWN content
when I do these searches.
I submit DMCAs on everything I find and I am having some success. But I still have 400+ pages left to check.
My question: IS THIS SOMETHING PANDA WOULD HIT MY SITE FOR?
I honestly don't know - because ultimately it's MY content - I didn't copy it. And I don't know if this falls under the Panda penalty's actions.
My second question: IF IT IS CAUSING A PANDA ISSUE - CAN MY SITE COME BACK FROM IT?
Thank you in advance - I appreciate your time.
|You might like to read the two posts Lisa Barone put together from Pubcon. |
Ok, an interesting read basically meaning that for Google to try and attribute the original we have to have Google+ accounts whereas at the moment the other seach engines are getting it correct without that necessity.
Therefore is the question now to be for those affected "will Google recognise me as the originator if I use +?"
|when a site is hit with Panda, it's PageRank, the web page's standing in the algo, is diminished. With the diminution comes it's inability to rank for phrases. If a site is unable to rank for phrases, it follows that other pages will come in to take it's place. |
Yep, I understand what you are writing however just HOW can the originator be replaced by a scraper if that page is an authority?
It's nonsensical, it's like you painting the Mona Lisa and hanging it on a wall, then I make a copy of it and hang it on my wall and for whatever reason Google's determined my wall is better than your wall therefore your painting is no longer the original.
No matter that the scraper has used my coding, my text, my imaging, my layout in precisely the same format 7 years after the original went live and all of a sudden they are the authority site even though they have absolutely nothing else different on their site for ALL my work except a different contact address and domain name.
If that beats my internal PageRank then there is more than a little thing seriously broken at Google.
I think Matt once said that Panda has nothing to do with duplicated content
Okay, this is a hot discussion. I think first and foremost everyone needs to realize one thing.
Google hasn't send anyone a notification that their site has been affected by *insert various algo updates here*.
In other words, it's all speculation about whether your site was hit by Panda, Penguin, ATF update, EMD, Freshness Algo or any one of the 10,000 monthly algo updates that happen.
So given that we don't know EXACTLY the algo, we can be assured that it's one of them. Why? Because the other search engines (I think) are getting the proper authority or ownership of the content. Or at least better than GooG.
So, to summarize, the kiss of death is to make an assumption that it's Panda etc. We should start talking about the how's and why's of evaporated organic traffic.
The reason I mention this is because before people go blowing up a site, there must first be a loss of ranking in Google. My personal 2012 guide book is to see if one of these black holes out there sucked up the ownership of my own content. If that didn't happen, then perhaps I look into what my site might be doing wrong.
These are asinine discussions ultimately. I for one assumed Panda. Actually twice. Once? Well jeez it was a penalty (manual) where I swore that it happened during a Panda update. Wrong and about 12 months of misery that I had because of misdiagnosis. My second nightmare has turned out to be because a site actually took ownership of my site/content and now I'm delegated to "omitted results" only in GooG. Again, I thought Panda. Is it really Panda?
The point is nobody here, no experts or self proclaimed SEO gods can really say for sure what you're suffering from. Part of diagnosing is determining whether this is a GooG only issue. That might help you decide what or whether to pursue.
The fact that I will stand by is that there is an algo issue. This is a biggie and in my opinion, it goes far beyond Panda. It's at the core of what we do. It's about who owns what and that your creations can be owned by somebody else and they can cash in on your own creations and writings. It's worse when the 90% market share algo is getting this wrong. It's a bad situation made much more worse for the average webmaster. It's called anarchy. What I see is this as a GooG exclusive. Whatever happened with all the tweaks is that more than ever copied content is outranking or simply replacing original works. Yes it's always happened. It however is predominant in GooG and not the other search engines. This isn't the way it's always been. It's the way it is now in 2012.
The only remedy here is to first be aware. Then second, fill out that damn GooG scraper outranking form. Third is to voice what you are seeing.
This is a full circle mess. People are now suggesting that your site lost authority, and therefore the scrapers outrank you. That's to suggest that you are in the wrong and that you have something to fix. It's to assume that the algo isn't a fault here. That's a damn big assumption to make. Dare I say ignorant. Let's at least consider the fact that an authority site/high PR site can take a lesser sites content, post it (or all of it) and take the rankings from the originator. It appears as simple as this. It's the only way I can explain my situation in dealing with a black hole that took away ownership from me on my own site. One example and perhaps unique, but a real example of just how bad the algo can work right now.
No one person webmaster operation is going to spend their life submitting DMCA notices and fighting scrapers. That's endless, tiring and futile. If the other search engine is doing something better on this, then the fact is they are more webmaster friendly. No other way to say it. When an algo can incorrectly let other people widely outrank original content, you have a serious and horrible situation for a lot of us. Almost a close shop situation.
|Right now it appears that if a good PR site decides to copy, they will pretty much outrank any smaller, lesser PR site. |
Yes. If you go back to the original Panda threads from Feb/March 2011, you can see this being discussed hotly - and I recommend people re-reading those threads - so many people were sharing data that it was possible to deduce patterns (these days those who get hit by Panda only have a few anecdotes to go by), some of which were on the money, eg we spotted the above the fold issue way before G admitted they were targetting ads, and some of us were arguing that people should switch off their RSS feeds (and if I recall correctly, Tedster tested this out with one of his clients a few months later and they reported a partial recovery).
It's also possible to disable hotlinking in your cPanel - do it, it's the only way to deal with images being stolen by Pinterest.
What I think happened, was that G was so focused on taking down the article directories like Ezinearticles (who allowed syndication) they felt that anyone syndicating their stuff would be a low quality site and therefore it would be "safe" to use the presence of duplicates as a signal that the originator was engaged in syndication. I guess they didn't think about the scraping issue - that only surfaced after they released Panda, and though they've had several gos at trying to fix it, they have been unsuccessful.
It happens with Penguin hit sites also. The Panda/Penguin/EMD penalty comes first with the direct result that you lose Authority, Only now can scrapers outrank you.
Confusingly, one of the reasons a site will be hit by Panda can be because of Duplicate Content. However, this duplicate content penalty can come from within your own site or from external sites, but those external sites will have better Authority then you, we're generally not talking about scraper sites when listing Duplicate content as a cause of a Panda penalty.
|Let's at least consider the fact that an authority site/high PR site can take a lesser sites content, post it (or all of it) and take the rankings from the originator |
Apparently this is can be done as was mentioned earlier in this thread. It was also discussed here on Webmaster World [webmasterworld.com...]
I hold out hope that it does get sorted out. It's never boring webmastering! At times it seems futile but I'm confident that this will get addressed. I'm just not sure if there are enough people out there voicing the issue of losing out to scrapers.
I think nobody so much cares about scraping until it's at the point that you lose rankings and traffic because of it. It's probably never going to be perfect. I accept that.
I'm pretty passionate about this subject now. I still am baffled that on this forum, so few people are unaware of the Google doc which you can fill out to communicate searches where a scraped page outranks an original page. How is that it's not a priority situation right now? Either I am in a minority with losing rankings to a scrape job, or I'm actually more enlightened than most.
As webmasters we need to use whatever means there are to communicate that the algo is having issues. That doc is one such (and very rare) communication tool where we can collectively say there is a problem here. Enough examples submitted and there might be some extra effort or acknowledgement of possible issues. If nobody says anything, then there is no issue. I can also accept that I'm one out of 1,000,000 webmasters seeing my site tank in GooG but not in the other search engine. I just hope that more people consider this option first (investigate) before deciding to chase the pot of gold at the end of the Panda rainbow. Before blowing up your site at least to a GooG vs. the other search engine to see who is ranking for your own content. It's simple and may save you a LOT of stress and frustration.
Added: I simply don't buy into this theory that my site had/has an issue and that the issue caused loss of authority, and thus the scraper outranks me. Let me hear that from GooG officially that their algo wants to work this way and that's a desired outcome. Sounds pretty f'up to me if you create something where scammers can flourish. Wow, that's scary and no, I don't believe that they would knowingly accept that as being part of their outcome. How could that not be consider a failure? Um, GooG QA team says site A is bad, we are Pandalizing them, but seeing scraper site B, C and D take those rankings is an appropriate outcome? And so if site A can't figure out what's wrong, then having the "theft" is "working as planned"? I can barely get my head around this thought. Which comes first the chicken or the egg. Yeah.
|So in other words, people here are suggesting that if your site tanks, simply create a scrape of your own site, launch it, then enjoy what your original site enjoyed? |
OMG - finally a solution to my problem :-)
It almost blew my mind when I read that.... because it almost seems plausible. Ridiculous.... and yet - plausible.
|The toolbar is something used by Google in a punitive manner to manipulate web publisher behavior. |
|, re-write your article/post and have a small mourning moment for the beautiful, personal, well thought out content you originally wrote that no longer belongs to you |
This. Get a thesaurus and rewrite all your content. Make it bigger and better than the copy that was stolen from your sites. Let scraper thieves have your 10-year-old castoffs, and watch with delight as your sites ascend the ranks once again. It takes a long time, but it's worth doing even for content that has not been scraped.
|Get a thesaurus and rewrite all your content. |
I have been constantly updating my sites since 1994 and all the information is evergreen, it's not April 1st is it?
Besides, ALL the other search engines provide better results and do not seem to have this issue...as MrSavage suggested, don't go blowing-up your site(s) unnecessarily because it's probably not you who is wrong.
|Have you added your Author Profile in Google? |
Thanks. I waded through that. Tried it and will see.
One serious flaw in their author verification is the email method of verification.
No web site owner, with brains, would ever put an email link such as: email@example.com on their site. Haven't google ever heard of spam harvesters? Why not a verification code in the header? Oh! We've already done that.
We were hit big in most Panda versions, but not, as far as I can tell, Penguin at all. We dropped from all page 1 competitive keywords to 300+.
Well, I recognized the effect being discussed here a couple weeks ago and even mentioned aspects of it along the way on here, but in the meantime I have been running experiments. In multiple cases now I have posted new articles and have submitted them via various methods: WMT get as googlebot/submit, pubsubhubbub, Sitemap submit, etc.
In each test situation I was able to get our copy IN THE G INDEX, ALONE, and confirmed as the SOLE result (based upon excerpt search). Other copiers who submitted theirs as much as a day or even two later STILL eventually supplanted us as the top source and relegated us to the supplemental results at the bottom. There was in one case even one idiot copier who posted later who had 7 copies indexed with different tags (lack of canonicalization) whose all 7 duplicate pages on his site were ranking higher!
I think the term "loss of authority" is very accurate. Our PR hasn't changed and in most cases are higher than the pages we're listing below. I suspect it could be if your content is copied a page or two each by too many DIFFERENT sources, as many of ours are, YOU start to appear to be the copier, compiling content from many different sources.
The other aspect that you apparently have missed in the articles mentioned, is that when you are suddenly determined NOT to be the "authority"/original source, that the PR from links to YOUR copy (and all the other copiers pages presumably) gets transferred to that of the recognized authority copy! Therefore, if this is true, not only do you lose credit for the page itself, but you leak additional PR from ALL the pages on your site that you yourself link to this content. Basically you now link to the copier. These pages on your site effectively become toxic hotspots with leeches sucking away your lifeblood. Get enough of them and you die, while the top copier, like The Blob, gets exponentially bigger and better PR with each copied page they munch on! Sounds to me like an out of control algo aspect with insufficient dampening factors. Eventually G, by virtue of a PR10, will be the only one ranking for ALL the copied content on the net (which I'm sure is their intent anyway.)
My current desperate experiment is to simply de-index (rel=noindex)(JUST on Googlebot) all the pages that I can find that we rank below others for, to see if we can get out from under the effect and reverse our being seen as a copier.
It almost seems hopeless for the original author? I feel like I should stop writing any kind of content for the time being, because I am only spoon feeding the thieves.
This doesn't even seem possible? Why would Google create an algorithm that purposely punishes original authors that are following their guidelines?
It's almost like instead of improving their algorithm, they've just tweaked it enough to push their own agenda (making more money) with CYA labels like Panda and Penguin.
Definately change those images to something really HUGE and let them keep linking...
change it to say something like:
IMAGES AND CONTENT STOLEN FROM WWW.EXAMPLE.COM
Then let google know about it, blast those sites with DMCA violations. Having your images still hot linked and stating that they are yours will surely give google the information needed to wipe the walls with them.
Sadly, Google has a tendency to like freshly found content over anything else... not fresh as in new, but fresh as in they just found it.
The sad part is that you will likely need to rewrite your content and put them on a different page, leave the original so that Google can verify it's yours, I wouldn't touch the original page until Google has had time to assess the situation.
When you rewrite the content, imbed your url somewhere in the middle of it. Most scrapers don't take the time to look for that ... they just scrape and go.
If it were me I'd lose the RSS feed or somehow find a way to imbed your url into the middle of it as well. We ditched ours a couple of years ago because of a similar situation.
|So - what your telling me - is Panda hit me because my site lost authority? |
No. Panda might be the reason for your loss of authority. Off-course no one other than Google algos know how "Authority" is arrived at. One tradition element is "links". So if the algorithmic update looked at links differently or devalued certain kinds of links, it might result in loss of authority. But again, don't jump on me saying that someone is suggesting links were the cause of loss of authority and hence Pandalization. I am just saying it is just one element (and not necessarily the only element) used in arriving at "Authority". Well, this is something beyond the visible PR. I know examples of sites that still has the same good visible PR, but has lost authority since Psnda. It could also be that with these algorithmic updates, several new elements are now contributing to "Authority" than just links. google will definitely not let anyone know what all elements contribute to "Authority".
|This is a full circle mess. People are now suggesting that your site lost authority, and therefore the scrapers outrank you. That's to suggest that you are in the wrong and that you have something to fix. It's to assume that the algo isn't a fault here. That's a damn big assumption to make. |
No,no, no. I never suggested that Frost_Angel or the webmaster of any other site affected by algo. changes are in the wrong. It is a completely wrong interpretation. What I am saying is scrapers outranking your site for your content is a symptom of "loss of authority". So pls. don't create a mess.
For simplicity, I visualize Authority as something like this.
Total Site Authority = Link Score + UE Score + Brand Score + ....+...+...
I use the words "Score" and "Authority" interchangeably in the above formula. It might even be that Google is still using only the Link score but computing it differently. Or they might even be assigning different weights to those scores.
As we all know google was using links in determining PR in the early days. They could now be using various other scores in determining what I now call as "Authority". You could call it as "internal PR" or whatever is more convenient for you. I believe there is no universally agreed dictionary on this. But what is more important is links or just good content might not help. One need to market the brand, improve UE and do all things good for Google in the name of users ;)
I've this problem too. People who copy our content and end up ranking higher than us.
I'm in the process of submitting DMCAs and checking 2000 pages for scrapped content. We have been ranking for 7+ years on our original content and since last year now our competitors are ranking higher than us.
To make things worst, we are an ecommerce site and we write the description of the products we sell painstakingly. To end up having our text stolen by our competitors and having their content rank higher than us is really adding salt to the wound.
Don't understand why Google can't compare a new page against their existing index for duplicate content before ranking them. Don't do evil is really a joke.
Had a 10 year old site that was hit with Panda on the first international rollout. All "original" content - i.e. stuff that had been written specifically for the site. Most of the content ranked pretty well and was scraped to hell.
The key part is in bold there. "Original" doesn't mean "good", and it's that quality angle that Panda is all about.
I left the site for the better part of a year to see if anything would change (it didn't). Early January I worked on a fix (removed about 10 articles which I deemed to be of a low quality and didn't really deserve the rankings they previously had). Site recovered fully during the next refresh and has been fine since.
Duplicate content is a standard go-to solution for SEOs for a lot of problems, and certainly may be a factor in Panda.
But what could also be quite likely with the OP's situation is that there is absolutely nothing "wrong" with the site in terms of bad SEO stuff (dup content, canonicalisation, keyword stuffing, etc etc). I.e. nothing that "triggered" a penalty.
It could just be that Google has shifted the goalposts in terms of what they deem to be of a good quality and regardless of "authority" or "white hat SEO" of the OP's site, it is still lacking on that front.
The real big disruption that Panda caused is that even if you've been purely "white hat" over the years, junk content (even well-written junk content) that's purely and blatantly for SEO purposes can ruin you.
Time to re-assess what constitutes "quality content" IMO, and for a Panda solution, you need to be brutal about what content you keep and what you purge.
To put it into context, I'd label some (the minority) of guest posts on SEL as being of a low quality. They may be well written, sometimes by seasoned professionals, but they can be thinly veiled SEO / marketing efforts at times.
On the other hand, Barry over SER does pretty succinct short posts - sometimes less than 50 words.
"Traditional" SEO would probably say that the more lengthy articles are better - it's something we've all believed and advocated. But in a Panda world, quality outranks (or at least, can "override") length, source, authority and so on.
PS - I'm not saying dup content isn't an issue BTW; just that it might be less of a cause of problems than people generally imagine.
Hey Marketing Guy, So how do you think Google is measuring quality? How do the algos know that an article is of good or bad quality?
Though it might be your perception that guest posts on SEL, though well written by seasoned professionals, are of low quality because they are thinly veiled SEO / marketing efforts, how do you think google is determining them as bad for users and why? Yours, if I am not wrong, is like saying anything done for marketing is bad though I believe it is the opposite of what you say i.e. you need to put in a lot of marketing effort to succeed in a Panda world.
[edited by: indyank at 10:05 am (utc) on Nov 21, 2012]
Marketing Guy: "removed about 10 articles which I deemed to be of a low quality"
Here is the key question of the day... 10 articles out of HOW MANY total?
I'm not saying marketing (as an industry) is bad or even writing content for marketing purposes is bad. Think more in terms of those articles you've read and thought, "hang on - has that actually said something useful?". I.e. articles that are just thinly veiled pitches.
Re: measuring quality. Can but speculate, but I don't think it would be too tough for Google to pay thousands of quality raters over the years to produce data that could be folded into search results. Less about "quality" as we may think of it - perhaps "relevance" is a better term to use? I speculated as much earlier in the year (http://www.fusednation.com/search-engines/google/google-panda-getting-your-rankings-down/), although that article is *very* speculative and still has stuff in it that has since been disproven (so pinch of salt if you read it).
10 out of 100ish, so about 10%.
To elaborate on what I was saying above, the articles that were removed were thinly veiled SEO efforts, compared to the rest which (while, not particularly well written), contained well researched, useful information.
Working on the theory that Google at some reviewed my site using quality raters, the articles removed were (in the context of competing results) slightly relevant / useless, whereas the rest of my site may be considered useful / relevant for it's targeted terms.
I think that right there is the important distinction to make. Marketing intention, etc aside, if you want to rank for "webmaster forum" you need to be an actual webmaster forum, and not just have an article (however well written) on the topic of "webmaster forum".
Going off topic slightly, I think the integration of quality rater information is why we've seen brands getting more weight over the years. It's less about Google favouring corporations and simply just normal users favouring brands when rating SERPs.
How is Google measuring quality?
Who says it's Google doing the measuring? I think we all might change the way we approach websites if we assumed one person out of one hundred to visit our site is acutally submitting feedback directly to Google that impacted our rankings. Just a thought. :)
|10 out of 100ish, so about 10%. |
What was your traffic like before and after recovery?
March 2011 (before Panda) - 104k pageviews
May 2011 (after) - 35k pageviews
March 2012 (after fix) - 96.5k pageviews
Figures are for first full month after event. Stats have fluctuated with some Panda updates since, but not by much. Past 4 month average has been around 70k, but that's normal for the time of year.
Most rankings returned or are stronger. Some didn't recover but I kept the content as I thought it was decent. New content resulted in some of the new traffic.
I should also add, while some of the rankings didn't return specifically - i.e. a previous top 5 is now second page, most of the long tail traffic the page brought in did return fully.
For example, one word has 17k+ variations over a 12 month period. The traffic in total returned to about 80% of its previous level (the other 20% can be attributed to a single ranking - the main use of the word.
I'd theorise my 10 month(ish) absence from SERPs for this site gave competiting sites some time to advance, so when the site recovered, competition was tougher, hence the lower ranking for the competitive term, but same rankings for long tail terms.
|I feel like I should stop writing any kind of content for the time being, because I am only spoon feeding the thieves. |
The trouble is, plain text written content is so easy to scrape. A lot of people talk about big brands getting a free pass, but they have the resources to create types of content that are far more difficult to copy: widgets, tools, and games that work with server-side technologies. Scrapers won't bother to recreate these, they'll just leave them out or put up versions that don't work.
Brands are probably doing more pro-active blocking of scraper bots because they can afford to employ experts, and they benefit from offline marketing efforts that lead to online links. But these head starts aren't anything that smaller outfits can't emulate in some way.
Look what the scrapers do now. This text is above the contact form on one of the sites scraping the whole internet:
All our content is for informational use only. We can use information found publicly on websites due to the Fair-Use clause of the Copyright Act of 1976, 17 U.S.C. § 107.
So please refrain from contacting us with threatening messages regarding copyright infringement, because no such infringement is taking place. We are providing an informative website about the ranking, meta-data, and publicly available records for websites. Information about your site is considered "public-domain" according to the laws of the United States.
|A lot of people talk about big brands getting a free pass, but they have the resources to create types of content that are far more difficult to copy: widgets, tools, and games that work with server-side technologies. |
That's a good point. Some big brands have a lot of widgets on their sites and that makes it harder for their content to be scraped and if some of their content is scraped, with the authority that the sites have, it may not effect them a lot.
[edited by: gouri at 3:14 pm (utc) on Nov 21, 2012]
I don't hear anyone talking about how to stop the content thieves. Yes, content scrapers are the scum of the web, but there are ways to stop them. First, you should <not> deliver your entire article through RSS, only a a portion of it. This will encourage your readers to visit your site anyway and take away the easiest avenue for people to steal your content.
If you are still having trouble with people stealing your content, then you should try one of the services specifically meant to stop content scrapers... <snip>
You should absolutely fight back, dont let the thieves win.
Mod's note: My adding "not" to the above post to fix an apparent typo doesn't suggest that I necessarily endorse this approach. I can see how partial pings or fat pings can each create their own problems.
[edited by: Robert_Charlton at 11:55 pm (utc) on Nov 21, 2012]
[edit reason] added "not" and removed specifics [/edit]
> Frost_Angel - re your "they are using images from my site and using MY bandwidth to run them!"
Search for "image hotlinking" and you'll find tips on blocking it. Add your CMS name (ie, WordPress, Joomla, etc) to the search for tips which apply to your tools.
Thank you - doing that right now.
At least that will be something preventative that I can do.
A service to help would be great. Will look into ti - but for one-person show like me, that's been hurt so bad income-wise - "services" become unaffordable luxuries. No matter how bad you need them.
Yep I seeing this exact same thing with my site.
I just checked one of my articles in google using the " " method
I find 3 pages of my copied content, all directly copied from my site, some using my images as well.
I have to click the "omitted results" to see me right at the end in last place and I'm the actual author of the work!
How many years has google been working on there algorithm? in my opinion they are absolutely clueless.
[edited by: engine at 12:38 pm (utc) on Nov 29, 2012]
| This 98 message thread spans 4 pages: < < 98 ( 1  3 4 ) > > |