Yes, scraping is sometimes a big problem. And as soon as one scraping site is dealt with, five others seem to pop up.
I've found the best approach (outside of reports and DMCA take-downs when the violations are extreme) is to strengthen your site's authority and trust. Scrapers usually cannot outrank a strong site that is seen as a trusted authority.
Why do people insist on calling this process "scrapping"?
Once in a post might be a typo, but six times wrong?
It's "scraping". One P. </endOfRant>
Mod's note: I have changed the spelling in the thread title.
[edited by: Robert_Charlton at 11:59 pm (utc) on Jan 12, 2012]
Or find the way they are getting your content and 403 them. Big volume scrapers do it automatically so there is almost always a way to block them. After that you just have to take them down from the SERPs, which is the real tricky part.
THANK YOU! That has bothered me so much, for SO long. Glad someone else finally mentioned that :)
Well, maybe that's why google has been having such problems with this... maybe they've put all their resources into curbing "scrapping" instead...
On a more serious note, when I saw the thread title, "Looks Like This Site is Scraping the Whole Internet," I thought the site in question would be google itself.
Just think about it, if all webmasters would stop GBot, this way we could kill Goog. But this will unfortunately never happen Goog will always scrape your site.
"Fighting scrappers is a loosing battle."
That's definitely what is happening :)))
That's not entirely true. You can't stop them completely, but you can reduce it considerably.
Er, I think that was what is called in the vernacular a joke. The kind that's told around the dinning room table.
I liked the original spelling of the thread title. Who needs an internet when you've got google? (Same principle as: who needs malls, shopping centers or downtowns when you've got Walmart?) Or we could look on the SERPs as a sort of virtual scrapbook. Clip and save the best parts.
>>Or we could look on the SERPs as a sort of virtual scrapbook. Clip and save the best parts.
interesting how people think differently about words...
i didn't think of scrap books ...
what comes to mind for me is scrapping old silver, eg selling it for the base value to be melted down.
Double "p" is combat (scrapper)
Single "p" is chipping away (scraping)
Oddly enough, both apply!
... and the opposite of 'winning' is 'losing', not 'loosing'.
>>... and the opposite of 'winning' is 'losing', not 'loosing'.
glad you said that, i had thought it was some kind of pun or joke that i didn't understand - either it wasn't or you didn't get it either!
ehh, as far as I can tell this is an international forum... and English isn't the only language... I'd say get over it... take the content for what it is.... if people want to sit around the dinning room table, talking about loosing traffic to scrappers... does it honestly make your coffee taste better if the forums are full of well written prose?
|And its not just me, thousands of websites have been scrapped by this scrapper site and Google is happily sending them millions of hits. |
This side discussion regarding spelling/meaning has not answered your question. Sadly, there is no real answer other than expenditure of time/effort/money to DMCA, perhaps block, perhaps rewrite and attempt to rank again... Copyright infringers (scrapers) have been with us since day one (I go back to Ug and Ugette and their little Uglies around the prehistoric campfire hundreds of thousand years ago) and it is not likely to change any time soon.
Current consensus is to become an authority site with better (more wholesome) backlinks than the scrapers... but that is also more work/effort/expense.
As for preemptive action... how can one know who will be the next scraper? Playing Wack-a-mole. This is on-going, certainly frustrating.
|the opposite of 'winning' is 'losing', not 'loosing' |
I am as pedantic about spelling as anyone, but I know a joke when I see one.
The quote was attributed to Noah Webster, who deliberately spelled English words incorrectly in his dictionary, and who died about 150 years before the internet was invented.
Back on topic, I agree with the suggestions that scrapers can largely be thwarted by 403 blocking (though this requires a certain expertise) and that newer sites without "authority" are
particularly vulnerable if such defences are absent.
Google have indexed 6.1 Million pages from this scraper site and it has only PR2 , Compete rank shows 200,000 visitors a month. Do Google no longer care for duplicate content ?
Maybe it is easier for Google to list the one site holding all the copied content and filter out the thousands of sites where the content originally resided? Grrr.
Do these scrapers copy the headers? If so you could use a canonical tag. Wonder if canonical tags inside of the body work (broken but would be curious to see if google still accepted them).
Scrapers may be lazy, and busy on creating domains and G accounts, but don't think they cannot learn... they are getting better at what they do by reading what we say here.
Sad thing is that there's so many ip's and so few of us tracking abuse from same that by the time we report them [webmasterworld.com...] they've moved on to the next set of ips.
And they are cleaning up the scraped copy, if nothing else they are breaking the interior links (ordinary search and replace) that it becomes even more difficult to combat.
There are 3 primary sources of site scrapers:
1. server farms and clouds - these farms inhabit known IP ranges which can be permanently blocked, perhaps drilling the occasional hole for a known good bot.
2. botnets - these can inhabit server farms (see above) or ADSL (broadband) IP ranges (see below).
3. home/business IPs - basically dynamic/static broadband IPs that are still under control of their owners (ie not compromised by trojans). These usually have faulty "credentials" which can be detected.
Most bots, particularly the high-speed scraper types, CAN be dealt with. It just takes a bit of dedication and time.
If you do not want scrapers you have two choices:
1. learn to run your web site properly, installing blocking software as relevant (linux/unix, for example, has htaccess capability, as have later versions of IIS: learn to use it).
2. buy in expertise from someone who knows how to manage blocking properly - it's cheaper than losing revenue to scrapers.
And, of course, check out WebmasterWorld's own "Search Engine Spider and User Agent Identification" and Apache (htaccess) forums.
One should really differentiate between a crawler and a scraper. For example, Google is not a scraper.
What's the difference?
A scraper "crawls" the web and then publishes documents to the web that other crawlers can crawl. This is where the harm comes in. Google crawls the web, BUT, Google does not publish new web pages with stolen content, that other crawlers can crawl and index.
I put a nonsense string in all my titles on one of my sites, something like, "kkljghik". When I search for my unique nonsense string, many, many, scraper pages/sites pop up, BUT, none of them have the domain, www.google.com, because Google does not publish what it crawls! (Well except for groups.google.com where some scraper idiot keeps inventing new group names and published copies of my content! Which it appears Google takes down fairly quickly.)
Anyway the unique string trick "kkljghik", at least makes it easy to find all the scrapers! For a 150 page site, Google returns 58000 plus scraped pages, perhaps 2 to 3 percent "might" be considered legitimate. Virtually all of these results clearly have extracted, and republished, some content from my site.
|For a 150 page site, Google returns 58000 plus scraped pages, perhaps 2 to 3 percent "might" be considered legitimate. Virtually all of these results clearly have extracted, and republished, some content from my site. |
Have you successfully combated these scrapers? Have you been able to deal with them proactively?
I am sure that google would NOT want to have scraped sites in their index (although it may seem like they don't care either way).
Maybe if we created a thread on the google webmaster tools forum called: "Has Your Site Been Scraped? List It Here!" we can get webmasters to list all the scraper sites they have come across in one location and possibly it would be easier than having to file a DMCA request for EVERY different scrapper site.
Maybe it won't help, but maybe we could at least raise awareness among other webmasters who have no idea what scraping is.
There are 2 huge scrapers of our main site. Each of them has over 1.5 million indexed pages in Google and a huge traffic (17k Alexa). All the theft is automatic and the sites look totally like low quality MFA with 2 AdSense square blocks above the fold. They have myriads of backlinks pointing to them and all they do is adding bits and pieces of content from zillions of sites.
These guys have been online forever and we reported them to G many times without success. Please note that I am talking about the main 2 scrapers but there are actually about 100 more thieves taking content from our site boasting the same "business model"
I haven't really tried. I don't have the problem of the OP, of scraped content outranking my original content.
Google's new rel=author photo id thing may help?
It may help Google sort out original author?
Some have said they've had success with something as simple as making sure they have a claim of Copyright on the page. Apparently some scrapers will honor a copyright notice, believe it or not. Make sure the copyright notice is in the same source (html) file as your primary content.
But I'm sure in most cases it's much more difficult to actually eliminate the scraped content.
And the scraper can't change rel-author to his own authorship?
I haven't yet seen a single google "this is a good idea to prevent stolen content..." that helps against scraped/stolen content. All the ideas can so easily be circumvented in some way.
The only successful one as far as I can see would be via webmaster tools, but last time I looked there was nothing there to show that content tied to YOUR WMT verfication ID was contested when found with either no ID or someone else's verfified ID. Which should be easy to do, yes? I mean, if they really cared?
Basically: secure your site or lose it.
My fear is that we'd have to (some future requirement) alert G (or other search engine) that we have new content, here it is, and let us know when you've got that indexed and let us know when can we post it on line so you won't accept anything the scrapers do.
Yuck but not really far-fetched. Screenplays operate on exactly this system: they must be registered with the appropriate governing body before any reputable filmmaker or studio will look at them. (It's to prevent annoying and costly "You stole my script!" lawsuits-- or, ahem, actual script-stealing.) But the "appropriate governing body" is, of course, not any of those studios; it's an independent third party.
The website analogy would call for some independent body to look at web pages and verify that they don't match any existing page before any reputable search engine will index them.
This wouldn't stop you from changing a page after it's got the seal of approval. But then, scripts get rewritten too.
Do you think the importXML function in Google docs has contributed to more scraping?
| This 32 message thread spans 2 pages: 32 (  2 ) > > |