|Google, Panda and scraped content--what are you doing?|
One constant refrain in the complaint threads about Panda is that scraped content is outranking the original, and Google has no way of determining which is the original (or doesn't care).
My question is: what are you doing about it, if you're doing anything? Are you re-writing your content? Filing DCMA complaints? Doing nothing?
Where I have scraped content, I'm re-writing, as I don't think that complaining to Google will do anything. This is especially true since many of my scraped pages have been so completely trashed by Google that they don't show up when doing a search in quotes for the scraped content. The scrapers do, though.
What I am doing is taking any content that I've spent time researching and writing and doing bulk submissions to the US Copyright Office. I'll put 25 to 30 pages in one submission at a cost of $35. I'll get a copyright number as each submission is processed.
I figure Google may take more seriously infringements on registered copyrighted material. Also, with the registered material, I can threaten lawsuits with punitive as well as compensatory damages for any infractions. This can't be done if the copyright is just slapped on after the page is written, but not registered. Also, if I ever get 100% fed up with Google and do a total robots.txt no index for G, maybe I can sue them for including copyrighted material in their index. ;)
How about you?
|Also, with the registered material, I can threaten lawsuits with punitive as well as compensatory damages for any infractions. |
And what can you do when the thief is located in Russia, Indonesia or Vietnam? A lot of my content has been copied by people in those countries.
|And what can you do when the thief is located in Russia, Indonesia or Vietnam? A lot of my content has been copied by people in those countries. |
I'm not particularly concerned about foreign websites, as they're not as much a threat as domestic. I've had my writings scraped by sites whose owners I do business with. They didn't know their webmaster was ripping off content.
Writing it to make it better is fine but re-writing it because the content has been scraped is not a good solution as it can again be scraped.Will you be rewriting it every time it gets copied?
Moreover, by re-writing the page you loose the original content to such scrapers and what is the guarantee that Google will not consider you as the scraper, as you seem to be rewriting the content?
But i do agree that this is ridiculous and frustrating.
I am also now worried that we might have lost ownership to content that we de-indexed using "noindex" meta tag.If we plan to index them again, we should probably be re-writing every sentence of it as this panda is demoting pages and sites, even if there are a few duplicate sentences.
This will involve a lot of time and effort and by the time we re-write, we might have lost the ability to rank the page in SERPS.
Im leaving my original content up in the blindfaith that one day google will realise that we are the original source. What makes me laugh is that most of the scrapers actually link back to us as the original source but google still ignores this and puts the scrapped content above our original piece.
Yes, DMCA may not work in other countries. Knowing the kind of violation that google permits on platforms like Youtube, I know that google is now becoming a major threat to copyrights.
You can get to watch any movie on Youtube on the same day it is released in theaters. If it is an Indian movie, Google just makes sure that the video is blocked in India but I could still get to see it in Australia.Google doesn't block it in all countries despite knowing that the video has been uploaded violating copyrights.
Google seem to be extending this style of functioning to Search engines!
|Google, Panda and scraped content - what are you doing? |
Not as much as I could be doing. We've used NoArchive for years. I believe that accounts for at least 50% of scraped content. We block Internet Archive. We block specific IP ranges.
My goal for a U.S. based ecommerce website shipping to U.S. addresses only would be to block all but a few trusted and surrounding countries from accessing the site.
If that isn't an option for you and there is a global audience, then your work is cut out for you. There are a handful of folks that I know from WebmasterWorld that are blocking thousands of IP ranges from accessing their site in an attempt to minimize the scraping, bandwidth abuse, etc.
Once you've done the above, now you get to deal with the serious scrapers using a proxy. I can imagine this whole process being quite a large task for anyone to undertake.
The first thing you should do is get NoArchive implemented, that's my suggestion based on years of using it as a global directive. I've had my dev implement it at the X-Robots-Tag level and every site we host is NoArchive by default.
<meta name="robots" content="noarchive">
Same goes for that pesky Internet Archive. You block that via robots.txt.
|I'm not particularly concerned about foreign websites, as they're not as much a threat as domestic |
You should. They are hosted on American servers and Google doesn't know it is a "foreign" website. They can outrank you with your own content ...
|What makes me laugh is that most of the scrapers actually link back to us as the original source. |
Oh, you can do some Referrer Jacking in that instance. Teach them not to link to you. Try to remove that linked citation. Take it and do with as you wish using something like...
RewriteCond %HTTP_REFERER .*example.*
RewriteRule (.*) /jack.asp [I,L]
ISAPI_Rewrite .ini example shown above.
I've had links removed by doing the above during an Online Reputation Management campaign. There are just some inbound links you want to take control over. :)
We are extremely deep in e-commerce space of health and several other sectors. Sometimes we write the content, othertimes, our clients write it. We encourage clients to secure HON certification as a reputation badge.
Copied content, regardless of Google Panda, is yours if original and you add a canonical tag:
<link rel="canonical" href="http://www.[domain name].com/page.aspx"/>.
To monitor copycats, we use Copyscape on a routine/non-routine basis. We write (or our attorney) to the copiers (in the US) and request immediate cease and desist.
Even with being diligent, one can't totally stop the thieves, though the canonical tag is pretty sweet when you know your 'stuff is original, differentiated, and it appears page 1, above the fold!
|Copied content, regardless of Google Panda, is yours if original and you add a canonical tag: |
MarketingVictory, how can copied content become original by adding a canonical tag?
Just like you have a canonical tag to your page, the copier has a canonical tag to his page.How will this tag help then?
If the scraper forgets to change it the canonical tag points to your site ... Of course if they are good at it, they probably do a bulk find/replace which means it's totally ineffective, but it's possible it could help in some cases.
TheMadScientist, hmm... possible but i am not finding any scraper doing it wrongly.
Most of the guys that I encounter do it perfectly.They even take care to remove anchor tags and some even copy the pages to which those anchor tags link to.
In the mean time, I noticed Google introducing a new DMCA form for web search.When you report a scraper they keep track of it using an online tracker linked to your google account.
But I did not receive any confirmation email so far.
Yeah, indyank, if they're any good they probably take care of it, so you almost make it easier for them to add it to their version of the content by including it, but you might get a few of the 'amateurs' if you have it ... IDK if the pros (getting the tag on the pages the amateurs scrape) out weigh the cons (having it on the pages the pros scrape so a find/replace is super easy and they don't even need to add it to have it pointing to their site on their stolen copy) ... I guess that's one of those questions people will have to answer for themselves.
Well, I step away to get a bit for lunch ... and everyone has replied to you, indyank.
If the canonical tag doesn't do it, gotta go the cease/desist, DMCA route ...
1) If Panda did have an aspect where it penalized scraped content.
2) And if Google did miss-identify the source for content as instead being the scraper.
3) And if some "bad" pages on your website can affect your site as a whole in the new Panda-dized SERP's.
Then that would explain why essentially nobody is coming back after being hit by Panda. No one has yet rewritten enough pages (with 100% fresh content) (or happened to noindex the right mix of pages, who would noindex high quality pages they thought they were the obvious source for) to get the Panda penalty lifted off their site.
Possibly it's not a slow crawl or timed penalty, it's simply one that takes an incredible amount of time and effort to get from underneath and few can do it.
I am rewriting pages, but its impossible that their new content will ever be 100% new.
I should be finished next week in doing everything I can possibly think of to do to repair my site. (The problem being, as I've said before, it just wasn't that dirty.) I'm not expecting what I've done to really help.
You can't just keep throwing time at something that produces no results. I'm to the point where my site's rankings are going to have to heal themselves. It's time to move on to other websites that have greener pastures.
By the way, on the copyright registration. I'm pretty sure that (in order to have legal standing over a violator), you have to register your copyright within 90 days of publication, or else before the violation is discovered (or takes place, I'm not sure which).
There is also an option where you can register (or pre-apply for registration, or something) before publication but I have no experience with that.
I agree with indyank, rewriting and rewriting will be a never ending battle that we can't win.
But what is causing this?
Is it the scraping itself that causes it?
Or is being outranked by a scraper a symptom of something else?
Surely this is something that Google wants to correct. As soon as the mainstream press understand and start reporting t the public what is taking place then it will have a hugely negative impact on the Google brand.
With all those PhDs they must be able to work out who wrote the content on the internet first. Archive.org has the history of sites going back for some time. Surely Google has something similar in their archives.
In the real world this is literally like someone going into someone else's shop, stealing their inventory and selling it at a better location.
Microsoft have obviously caught wind of what Google are doing, and you would think they would take any opportunity to take down Google.
Off and on, I do DMCA. I also do spam reporting. Surprisingly, I am finding Google to be very responsive to my reporting. I only had one or two rejections among at least a hundred complaints I must've made already, and it was for cases were only an excerpt would be copied. I also receive notes that seem rather more personal than usual, and quite polite. So in the DMCA front, I've been successful. I don't do it religiously though, as it is so time consuming, but when I send them, I've had success.
Unfortunately, despite my DMCA strategy, I have not gained much improvement in terms of traffic. I have cleaned out a lot of pages from my site and still cleaning things out at this time. A lot of rewriting, noindexing, DMCA. That is all I'm doing at present. In the future, I am redesigning the site for better user experience and just to make it "prettier". Maybe add extra features. But I am taking it one day at a time and am hoping one day I will tip the balance in my favor.
Well, they were able to get it right in the past, with my site anyway. I've been scraped and scraped for years and years and I've always been first, until now...
Most of the people who scrape me are mainstream media websites. When I call them out on it, they typically blame it on "the intern." I haven't bothered with DMCAs as yet; usually I can make an obnoxious enough noise on twitter and other places to get them to either pull it down or attribute it to me (and unlike most, my republishing terms allow that anyone can republish the site, as long as they attribute it to me with a link back)
I have created a Republishing page that lists my terms and conditions, and includes this statement:
|I reserve the right to make as much a nuisance of myself as I have time to become. In past years, this has included (but is not limited to) phoning your place of business to discuss the matter, firing off emails to your management, issuing DMCA takedown requests to your internet service provider to have your offending pages removed, issuing DMCA takedown requests to Google to have your offending pages removed from the search engine (they’ve just put the whole process online; it’s amazingly easy now), calling you out as a douchebag on Facebook and Twitter, and posting screenshots of your theft alongside screen shots of my pages... Note to newspaper, television station, magazine, radio and other media websites: I am no longer accepting “Sorry, the intern did it” as a valid excuse. I got no fewer than SIX of those last year, which finally clued me in that it’s the industry standard “dog-ate-my-homework” response. If the first thing you teach your interns isn’t “Don’t Steal Content” then you don’t deserve to have interns in the first place. Period. |
I was going to remove the technical term "douchebag" but was persuaded to leave it in.
That's what I'm doing. For now.
Well since 99% of our scraped content is being used by some foreign countries we've completely blocked China, most of asia, all of india and most of russia so far.
@Netmeg - Nice. I like it!
+2 netmeg..:-)( just hope no-one copies it verbatim )
Nutmeg, do you mind if a scrap that...
It really shouldn't be that hard for Google to figure out, which is the original and which is the copy, provided the original's URL doesn't change. All Google needs to record is who was indexed first. This would probably solve 99% of scrapped content issues.
If I have pages who were indexed at their current URL in 1999 (and I do) and Google finds the same content on another site that didn't come into existence until say 2011, it should be really easy for Google to clue in on which is the original and which is the scrape.
(judging from my logfiles, it's already being scraped)
(my standard republishing terms therefore apply)