Should I Be Concerned With Scrapers?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Should I Be Concerned With Scrapers?

LostOne

8:49 pm on Feb 16, 2015 (gmt 0)

I've been fighting a battle over the last few months with content theft that fortunately has been 95% successful. However, scrapers have always been on the back of my mind.

I update content now and then, except for many opening paragraphs where scrapers love to play and steal. It didn't really hit me as far as having an effect until I began noticing by grabbing the first few sentences and checking the Gorg...my pages (some but not all) do not appear. Some actually rank quite well...strange eh?

Any thoughts? I have been rewriting opening paragraphs, but it's beginning to feel like a temporary fix and the scrapers will no doubt return, but after seeing 500 to almost 4,000 identical strings of text against some my originals it seemed like the Gorg algo could be looking at that as a red flag too?

I may as well add I've been in Panda jail for almost four years now, so as anyone that has been Panda stricken realizes we try every conceivable action to get out of the trap.

blend27

12:03 am on Feb 17, 2015 (gmt 0)

You can't escape that.

What You can do is monitor. Monitor incoming traffic.

We have a forum for that. Start Here: [webmasterworld.com...] . Create table in you DB, add ranges and block them.

What You can do is on weekly bases search for strings you already see as duplicate in Gorg Search, get domain IP where the dupe content is hosted(btw, the scarper usually does not scrape from that IP, but if the hosting provider allows scrapped content being hosted..., well, it is a hosting range), get IP range from that and block it - 2-3 hours a week, no more.

If You are ECom, Block countries that you don't deal with.

Bad Headers, bad UA's, fake referred pages, bot traps.
They want Robots.txt = Sure, but nothing else from last octet for a while until you verify it, sorry....etc,etc.

If you can do RDNS on incoming IP and it has one of the following in it:

.amazonaws.com
.your-server.de
.server4you.net
.hosteurope.de
.softlayer.com
.theplanet.com
.ovh.net
.xlhost.com
.serverloft.com
.fastwebserver.de

403, but not just a plain 403, record-capture all possible info you can, learn from it.

btw, this does not have to be you main domain :)

Point is, if You rank - then they will try to scrape. The more hosting ranges you got - the better defense is. The more you know your internal linking structure, the better you will be able to say 403. THe better you understand how normal browsers work.... you get the point, I hope.

Think think about this way, where is the weakest point in you defense if you have to scrape it your self?

Blend27

Robert Charlton

7:02 am on Feb 17, 2015 (gmt 0)

...grabbing the first few sentences and checking the Gorg...my pages (some but not all) do not appear. Some actually rank quite well...strange eh?

At a certain point during the MayDay update, if not before, it hit me that this isn't necessarily strange, depending on how you did the search.

Since that point, Google began returning results for concepts, not just word strings. If you search for the first few sentences on a page without quotation marks, just the plain sentences, Google currently will try to parse a 'meaning' from those sentences, and that meaning is very likely going to be different from the focused vocabulary the entire page is optimized for.

So, put any exact matches you're search for in quotes.

Scrapers, though, will sometimes spin or split contact, add extra words, etc... so generally I'll try to find a unique sequence of four or five words which I think are likely to have been left alone. It may take several tries to locate some scraped content using Google.

Copyscape, which is available for limited free use and is inexpensive overall, will list the most prominent pages/urls that have scraped you and identifies discontinuous as well as continuous duplicate content.

netmeg

1:20 pm on Feb 17, 2015 (gmt 0)

My sites are very heavily scraped (and in some cases, by sites with a lot more authority than mine) As far as I can tell, while it drives me nuts, it hasn't hurt me. I'm still pretty much outranking them all.

If you were Panda'd, I'd guess that it's not the scrapers that are causing the problem. But they may be benefiting by outranking you for your own content.

I don't think there really is an effective way to battle scrapers. I just try not to think about it.

blend27

5:37 pm on Feb 17, 2015 (gmt 0)

I don't think there really is an effective way to battle scrapers.

There is, block them.

netmeg

6:52 pm on Feb 17, 2015 (gmt 0)

block them

Doesn't scale.

aristotle

7:25 pm on Feb 17, 2015 (gmt 0)

There is, block them.

You can block every bot in the world, but that won't stop a real human from seeing your content and deciding to manually make a copy of it to put on their own site. I'm pretty sure that's how most of my articles get scraped.