This is one area where Google does seem to be more challenged than they used to be. It may be because there is more scraping going on (to me it seems like there's an explosion.) Also with the recent Scraper and Panda Updates, Google may have made some kind of change on their back end that is buggy.
Matt Cutts released a video today that looks at the topic How can I make sure that Google knows my content is original? [youtube.com]
One of the ideas Matt puts forward is using the pubsubhubbub extension to Atom and RSS [code.google.com]. I haven't tried that myself, but it does sound intriguing - especially because Matt says Google may be using that kind of data more in the future.
Matt made two interesting points:
1) Additional penalties come from counter-DMCA if caught lying. I have experienced a few instances where the content thief lied (who would have guessed that thieves will lie, right?). So, it's good to know that additional penalties will come down from Google for wasting everyone's time and resources.
2) A spam report will help eradicate auto-scrapers. I experienced this first hand a week or two ago, when I found a scraper taking portions of my content, as well as portions from of content from hundreds or thousands of other pages to create auto-generated content montages that made no sense. They had Adsense in every corner of the site. The Adsense Team disabled the ads within about 12 hours of my report. I was very impressed.
I am unfamiliar with the technology of scraping, but I recently learned about scraping programs available which enable scrapers creating this junk. I'm sure Google will eventually learn some of the signals.
Matt has probably never filed a DMCA complaint. Who do you think enforces penalties?
The only thing you can do if websites or ISPs ignore your DMCA complaints is to take them to Federal Court. All criminal sites and many semi-legit sites just ignore DMCA complaints.
On the Spam reports, I agree that they are the most effective Google feedback, but I suspect the guy or team in charge looks at the first ten that come in every day and flush the next ten thousand.
|On the Spam reports, I agree that they are the most effective Google feedback, but I suspect the guy or team in charge looks at the first ten that come in every day and flush the next ten thousand. |
Yeah, that's why I didn't file a traditional spam report. I reported an Adsense violiation (different form that goes to the Adsense Team) and checked the box "in violation of Google Webmaster Guidelines". It seemed to get noticed faster that I expected.
|Matt Cutts released a video today that looks at the topic |
This isn't a new thing at all, it's been this way forever. I usually defend his vague messages about what the spam team is doing over there, but he's really not even close with this one.
He talks a lot about how they can get some idea of who the original publisher is and how sometimes a copy might get indexed faster than the original. In reality, the autoblog plugin is running every X minutes, so there's only a short window between you new post going up and the scraped copy going up.
To take it a step further, you can copy content that was indexed 5 years ago and rank it pretty easily. The key is just more link popularity, authority, domain age, etc than the legitimate site. The mistake that most people make is they look for super popular feeds to load into their autoblog plugin. What you want to do is pick a site that has good content but maybe not the strongest link profile.
(your link profile + their content) > (their link profile + their content)
The only thing you can do if websites or ISPs ignore your DMCA complaints is to take them to Federal Court.
It isn't always made clear to newcomers that DMCA is a remedy under United States law only. I am located in the UK, my site is hosted in the UK and I have a .uk domain. In the unlikely event of receiving a DMCA notice (remotely possible I suppose if one of our reviewers submitted the same reviews elsewhere as well) I might just about be bothered to contact the originator with the words "go forth and multiply".
I might add also...just for fun one Saturday night (yes I have no life lol), I went through MC's blog and searched snippets from some of his posts to see how many scrapers he had. There were THOUSANDS of copies on older posts from earlier this year. Heck, just search a sentence from his April 1st post (5 days ago)...over 800 copies now. In a few instances from some of his posts back in 2010, a couple of sites outranked him. Surely this is an indicator of the magnitude of the scraping problem.
Granted, he has a lot of PageRank, so his originals rank in the #1 spot in most cases. But what about smaller sites that are getting scraped to death and are being surpassed for their own content, especially after Panda? Filing dozens of DMCAs are impractical (and expensive if you solicit the help of an internet attorney who will have better luck with hosting companies -- after getting ignored for years by hosting companies, I started using an attorney, and the cost adds up).
[edited by: crobb305 at 6:51 pm (utc) on Apr 6, 2011]
DMCA and Spam reports aren't scalable time-wise, unfortunately.
It's a pretty serious problem, but luckily it's not nearly as widespread as it could be. People have a hard time getting the link profile they need while avoiding the sandbox. Unfortunately I think this will stick around as long as there are easy ways to fake authority.
Here is the point though: Before Panda scrapers did not outrank my original stories. Now they do. That is all that matters. Google lost a functionality. I do not even mind legit scrapers. They aggregate news by topic and take the headline and a short snippet and link to my original story. Nothing wrong with that. What is wrong is that these snippets now outrank all our original posts on Panda. All is still fine on the panda free Google indexes around the world.
|Before Panda scrapers did not outrank my original stories. Now they do. That is all that matters. Google lost a functionality. |
I have the same problem. It is completely unacceptable that the algo is doing it. Panda favors thiefs, copyright violators and scrapers.
You can compare this algo behavior with a police officer taking the money out of your wallet and giving it to a criminal.
That actually might provide some good insight as to what actually happened with Panda.
|This is one area where Google does seem to be more challenged than they used to be. |
In another thread incrediBill makes the relevant point about a company having too many PHDs on staff, and to me, the originator of content problem is one important example where all their combined brainpower has apparently rendered them close to impotent. I would say, in part, that is the case because most of them do not originate content and thus are unaware of its central importance ~ it's simply a quirky right brain concept and obviously not as much a front burner issue (as it is for the rest of us). And it's probably not as "sexy" as +1 (yeah, now there's a high five!).
I'm waiting for the day where they call a press conference to make a public announcement saying that defending content origination is Priority #1 at Google ~ THAT is a worthy goal ... not just more bells & whistles.
incrediBill's Posting #4292806 [webmasterworld.com]
Is it possible that you are weaker hence these scrapers outrank? I am sure it is not lost to Google that this is happening. I was of the mind that this was a mistake but now I am wondering whether this was intentional!
So Google may be sending the signal that we fix our site's quality or suffer the humiliation of being outranked by scrapers...?
Incidentally, I've been regaining traffic over the weeks, gradually. Around 10% a week. I wonder if it will stick. I've been fixing everything that jumps at me that can be remotely qualify as a quality signal.
Weird one this. In the really short term the copiers will probably have more adsense and will make Google more money.
In the longer term it's nice to show the original copy - whatever revenue system the writers have they'll stop writing if they don't get anything out of it.
Seems a fairly important issue to work on and it's odd that Google don't always see the signals. I can normally see them so if they'd like to offer me a job (particularly an overpaid one) then I'd be up for it. Obviously I'd not be able to do algorithms - they have phds for that sort of thing but I reckon I'd do a fairly good job at guessing the originals.
How can you expect a computer to know the difference between the original source and the copycat? Seriously how can a machine tell the difference? You guys are expecting too much of google. If you are going to write for the web YOU have to deal with issues like these.
|Here is the point though: Before Panda scrapers did not outrank my original stories. Now they do. |
Yes, that is exactly the point. Actually I think the problem worsened about a month before Panda, right around the time they released an algo they said was designed to hurt scrapers.
|I'm waiting for the day where they call a press conference to make a public announcement saying that defending content origination is Priority #1 at Google ~ THAT is a worthy goal |
I'm totally behind you on that. If they can roll up their sleeves and take on rating something as subjective as quality, I'm sure Google can drtamatically improve in not ranking a vile scraper sites higher than the originator of the content. Scraping is not just against the guidelines, it as illegal in most countries. And here's the kicker - they USED TO BE BETTER AT THIS.
Sites that legitimately quote or syndicate content do make this job difficult. But it's not impossibly difficult if they take dead aim at pure scraping. The algo usually nails those sites after an extended period, but I'm sure it can be done faster. The main issue is that this is not currently a top priority, because as long as the user finds the information they want, Google's top priority is met.
If the thief ranks higher because they have a fancier template and looks like a magazine and more trustworthy, then Panda worked. Suck it up people.
The problem is really that there's no good way to detect copied content. Obviously which one is indexed first isn't accurate. You can look at HTTP headers, but those are ridiculously easy to forge. If you implement something based on that, it's like a black hat free for all.
What they really need to target to get rid of this stuff are the techniques these guys use to get the domain stronger than the original without sandboxing. You've got link pyramids, aged domains and a few other tricks that make it possible.
|If the thief ranks higher because they have a fancier template and looks like a magazine and more trustworthy, then Panda worked |
I was just thinking about this earlier. Funny you mention it. It occurred to me that the well-ranking sites look like magazines with colorful templates and lots of images and widgets. I optimized my sites for speed (for the user experience, few images, none of the widgets that other sites are already running, few banner ads that take forever to load...my site is very fast). Oh well, I'll put it in a fat template and find some widgets and gadgets. It will load slower, but despite Google's declaration that load time was a ranking factor, maybe a magazine-style will improve my site's ranking.
Somewhat ridiculous how ideas are put forward that are very technical for many people to naturally understand and implement, and are created to help address a problem that is created in the engines of a 100 billion dollar company.
Even more challenging might be that updates are now rolling out which are based on the assumption of unique and honest content, even amidst this very obvious elephant sitting in the living room.
|I optimized my sites for speed ... Google's declaration that load time was a ranking factor |
The same mistake that so many of us have made, in direct violation to Wheel's Law:
"do nothing for Google's benefit"
Village Idiot SEO [webmasterworld.com]
As anyone who writes software knows, it's easy to write bugs and more time is spent fixing bugs than writing new stuff (or should be).
Perhaps the algo detects an identical text segment on sites A and B, sees a link from B to A and boosts site B (instead of A). Could be a typo in there somewhere because quite a few reports of the opposite to expected effect being noticed.
The key to detecting a scaper site is surly to note that there are identical text segments on site X to sites A, B, C, D, E ... (many), that the matches are small in length but without a linkback, or large in length. One might not be able to tell between A and B by direct comparison, but by noting that B also has content matches with C and D and E ...
A s/w review is in order - second pair of eyes to check through the code, and some unit testing with sample data.
I have some pages that beat Wikipedia to #1, although they take close to a minute to load (because of many large images). So if speed is taken into account, its largely divided by bytes downloaded I think; so don't worry this too much I'd say. Uniqueness is much more weighted than speed.
One can begin reading and scrolling quickly with my pages. Sites which force you to wait while they download whatever can be irritating though - those with all ads at top and one cant scroll down and then the browser crashes - aaaagh ! But though my pages might be 10Mb in size they dont have this blocking effect.
Ted, your comments were great in this thread. That hubub thing sounds like a pinging program. Those have been out for a long time. I wonder what the difference is.
|So Google may be sending the signal that we fix our site's quality or suffer the humiliation of being outranked by scrapers...? |
In another thread I was talking about what happened to me on a set of keywords I have followed for years. I looked back to January 2009 and found the LA Times outranked me, along with a few other sites. A couple days ago I looked at those keywords again. Two new sites popped up - one scraped my content and the other sited me as a source and basically rewrote what I wrote. The scraper actually linked to my site.
Both sites were now above me and the LA Times. Neither outranked either of us.
I am not sure what happened - if it was Panda or before that. Something strange is happening.
BTW, I don't think that particular set of keywords cost me much. Maybe nothing.
|If the thief ranks higher because they have a fancier template and looks like a magazine and more trustworthy, then Panda worked. Suck it up people. |
If your neighbour has a fancy leather wallet and yours is only a cheap plastic version, then you don't deserve to have the money. If this algo would organise things in real life, the first thing it would do is take all the money you have been working for out of your wallet and put it in your neighbour's wallet.
Stop concentrating on creating good content. Concentrate on a fancy design to make the Google "quality" team happy. There's enough good content written by others that you can copy and use. Is this the signal Google wants to give us?
I'm not convinced that the template is a major algorithm factor - unless it buries the start of the actual content down below the fold. I see too many totally basic templates that still outrank their scrapers by a mile. But something more nuanced does seem to be on the loose about who should get the top ranking for content - something that is more defective than it used to be.
most of the scrapers that outrank me have no authority, are even totally off topic. I would agree that the NY Times can outrank me scraping my content, but not a Joe the Plummer site.
| This 105 message thread spans 4 pages: 105 (  2 3 4 ) > > |