How to prevent scraper sites . . .

Forum Moderators: not2easy

Message Too Old, No Replies

How to prevent scraper sites . . .

Is there anything you can do to block scrapers?

quizprep

7:56 pm on May 17, 2005 (gmt 0)

Short of sending out DMCA letters and contacting hosting companies, is there anything one can add to their websites/html code to prevent scrapers from duplicating your site?

I run a network of websites, which are copied wholesale so frequently - 15 to 20 times a month or so - by mostly foreign entities, that writing DMCA letters has become a part-time job, a job that is mostly pointless (7 out of 10 site owners never respond, and the ones that do, often simply republish the site under a different domain a week later). All the offending sites are merely replacing my logo with one of their own and changing the adsense code (though, rarely changing the affiliate links).

Is there prophylactic solution to this that I'm missing, or are scrapers unavoidable for folks with quality sites and original content?

larryhatch

9:09 pm on May 17, 2005 (gmt 0)

All of us (scrapers excluded of course) would love a magic bullet for such practices.
I have never found one. There are lots of threads on this list with suggestions which help,
but I see no strong single preventive method. -Larry

ownerrim

1:11 am on May 18, 2005 (gmt 0)

it's so depressing to read stuff like this. it makes you wonder, what is the point of working hard to develop a good website if you pretty much know that thieves will be copying it on a routine basis?

ownerrim

1:13 am on May 18, 2005 (gmt 0)

you may not want to give away your url, but can you broadly reveal the type of content it is? is it something that is perceived to be lucrative by overseas thieves?

quizprep

1:29 am on May 18, 2005 (gmt 0)

It's not one url, it's many; I run an online publishing company, which hires employees/freelancers with expertise to write online content, which we then add advertising (in the form of adsense, affiliates, cpm ads, etc.) to. It's in a variety of topics, both niche and broad, and some of the sites are actually somewhat well known. Apparently it is quite lucrative to folks overseas to duplicate our sites, though the duplication isn't exclusive to foreign scrapers, it's just that they are more difficult to stop (about 1/3 of it is done by U.S. website owners, but they are generally more receptive to DMCA complaints, or at least their hosting providers are). I understand that, in some cases, our sites our duplicated and then sold to other website owners en masse, who then republish them, often creating 5 - 10 copies of the same website on the Internet at once (which, as you can imagine, causes all sorts of dup content penalties).

At any rate, it's become increasingly frustrating, and I'd just like to find a magic bullet for future websites - I wonder how larger, commercial sites avoid duplication by scaper?

BigDave

1:32 am on May 18, 2005 (gmt 0)

Their host may not be in the US, but Google, yahoo and msn are. Get them removed from the SERPs.

jdMorgan

2:59 am on May 18, 2005 (gmt 0)

The really greedy ones can be easy to prevent. Some common methods come to mind:

Manually block well-known download tools by user-agent and/or IP address (black-list).

Allow only known-good user-agents to access your site (white-list).

Automatically block all-at-once page requests (block based on access rate).

Automatically block visitors that disobey robots.txt (traps, honeypots).

Block well-known open proxies; Detect and screen proxied requests.

In addition, some Webmasters may block entire class A networks because their site will not benefit from allowing access from those IP address ranges. For example, some sites may not benefit from out-of-country traffic at all, but might suffer site-scraping from other countries. I'm not suggesting or condemning such a wide-reaching access restriction, just mentioning it; The choice is up to the individual Webmaster.

Many of these solutions have been discussed in the technical/scripting forums here. There are also many Web sites about IP address and open-proxy blacklisting. These methods can help to reduce the number of successful scrapes, and thereby reduce your legal costs, time spent on DMCA filings, worry, etc. None are foolproof, but they can discourage those who might take your site for easy prey.

Jim

ownerrim

3:50 am on May 18, 2005 (gmt 0)

"often creating 5 - 10 copies of the same website on the Internet at once (which, as you can imagine, causes all sorts of dup content penalties"

Have any of your urls noticeably suffered in their rankings due to duplicate content issues, or have you managed to avoid that because you have more pagerank on your urls?

Also, I wonder how you first noticed this was happening. And, for that matter, I wonder how they began to notice your urls? Was it because the sites were prominent for certain keyword rankings?

I have a new site I am working on that will be loaded with hundreds of pages of indepth, original content. Because the subject matter has broad appeal, I am somewhat hesitant to go about optimizing so it will surface in the serps because of this very issue. I've thought about taking EFV's advice and getting a copyright. I've also considered using the *opyscape service.

The thought of someone stealing the content makes me ill. The thought of someone stealing it and causing ME duplicate content problems makes me want to hurl.

ownerrim

3:55 am on May 18, 2005 (gmt 0)

"In addition, some Webmasters may block entire class A networks because their site will not benefit from allowing access from those IP address ranges. For example, some sites may not benefit from out-of-country traffic at all, but might suffer site-scraping from other countries."

How do you block whole countries? I had a list of ip ranges once and it didn't seem all that easy to do. Many countries seemed to be assigned portions of quite a few ip ranges and due to that it seemed very difficult to sort out. Ideally, to be honest, I would block everyone but north america and europe.

jdMorgan

4:39 am on May 18, 2005 (gmt 0)

> How do you block whole countries

There are paid services which provide updated/maintained IP address ranges by country.

I didn't mention 'easy' or 'cheap' above, because these are both subjective terms. But the question was asked, "how can we prevent scraping?"

Jim

digitalv

5:15 am on May 18, 2005 (gmt 0)

Here's an easy way to block scrapers - first you have to be able to block access to your site by IP address (through .htaccess, if you're using a Linux/UNIX server, or by using a database or text file with filesystemobject if using a Windows server and then redirect the "banned" users to dead end page)

The next step is to make a page on your site that will add the IP address of ANYONE WHO VISITS IT to the ban list, instantly blocking them from your entire website.

Then you update your robots.txt to disallow access to this page, so no LEGITIMATE spiders will ever visit.

Then the final step is to make a link to this page (the one that bans any visitors) either in a table with style set to display:none or some other hidden method that no one will ever click on but a spider will see the link and follow it.

If the spider is obeying robots.txt it will know not to follow that link - but if the spider ignores robots.txt, as all of the "scrapers" do, it will spider this link and immediately ban itself from your site.

jim_w

7:06 am on May 18, 2005 (gmt 0)

I noticed that a lot of people are using broadband and PHP to spider sites. Since we are a BtoB site, we can look at those DSL address and we block out the last block, or what ever it is called. So if the IP is 111.222.333.444 we would deny to 111.222.333. This means that they have to log off their DSL to get a new IP, but chances are, they will still get one in the denied range. Since there are only so many providers in any given area, this seems to work well for us. But as I said we are a BtoB so when someone does visit from a DSL line, they are usually trying to learn something or research something so the few that may get blocked unintentionally do not hurt our bottom line.

Sometimes are are no �good� ways of doing something, just any way.

ownerrim

11:03 am on May 18, 2005 (gmt 0)

Hey, would this be helpfu?

[okean.com...]

quizprep

1:37 pm on May 18, 2005 (gmt 0)

Thanks for the advice, folks. I suppose I'll need, now, to learn how to block IP addresses (I'm a born-print publisher with limited web skills, unfortunately), and many of your suggestions sound doable (though, nothing as easy as frame busting code, a successful tool against the many thiefs who simply stick my sites into their frames).

As for the dup content penalties ... there is absolutely nothing more infuriating to me than when scrapers are capable of not only duping an entire site of mine, but SEOing it to death, and ultimately getting my site delisted while the scraper's site achieves decent PR, which is, unfortunately, an all too common experience because I'm more interested in drawing traffic through content than SEO.

I will reiterate, however, that the webmasters who copy my sites are not exclusive to China or South America - I have actually tracked one or two thieves to members of webmasterworld, an even more infuriating experience.

ownerrim

3:18 pm on May 18, 2005 (gmt 0)

WebmasterWorld members? very sad

quizprep,

not to beat a dead horse, but did you begin to see ripoff pages start appearing above your own pages in the serps?

quizprep

3:53 pm on May 18, 2005 (gmt 0)

I honestly don't pay attention to serps, so I don't really know how well the other sites fared in relation to mine. I discover the duplicate sites when I run google searches on my sites' content. I have no doubt that there are scores of sites copying my content that I simply haven't found yet because they aren't indexed by google.

Freedom

4:29 pm on May 18, 2005 (gmt 0)

I got an interesting email from a WebmasterWorld member awhile back who wanted to buy my websites because he read my posts on AS. Never heard of him before.

So I wrote him back asking what kind of sites he wanted and some other specs. Told him I wasn't about to give out my urls until I knew more about who he was and what he wanted.

In retrospect, a good idea because I safely assume he was just fishing for profitable AS ideas and content to steal.

IF someone you don't know sticky's you here about wanting to buy your sites, he just might be fishing for ideas and content to steal. After you've talked about daily income and site specs, details, - he has all he needs to know to beat you at your own game.

jbgilbert

4:39 pm on May 18, 2005 (gmt 0)

Ownerrim asked the following, but got no responses...

"Have any of your urls noticeably suffered in their rankings due to duplicate content issues, or have you managed to avoid that because you have more pagerank on your urls?"

I have the same question. I have a couple of informational sites that have scraped to death -- normally I would not care, but some of the "scrapers" have MANY, MANY, MANY links to me from their site and my rankings in Google have dropped "significantly" (no other meaningful changes have been made to the site to cause this).

Does anybody know if Google could be penalizing a site because a "scraper" has MULTIPLE, MULTIPLE, MULTIPLE links to a site it scraped?

jdMorgan

6:52 pm on May 18, 2005 (gmt 0)

My opinion is that there is no 'penalty' per se. But you lose your implicit 'ownership' of the contents, and now you have multiple pages --yours and theirs-- competing in the SERPs.

This is similar to the situation --often called a 'duplicate content penalty' for the sake of simplicity-- where you have one page on your site that resolves under two or more URLs; It competes with itself for incoming links and PageRank, and competes with itself for eyeballs in the SERPs if both copies are listed. The 'penalty' is the loss of content-uniqueness, and your loss of control over which URL will be used/shown/linked-to. I've never tried massively-duplicating a page to see if it actually trips a filter or garners a real penalty.

However, in the case at hand, this is scrapers competing with you using your content. If their page gets higher-PR incoming links, your page may well disappear.

----

An ounce of prevention is worth...
an occasional good laugh

Here are two typical sequences of events on a site that takes a few steps to head off problems from scrapers, harvesters, and/or stealth 'bots:


128.***.124.24 - - [18/May/2005:07:06:24 -0800] "GET / HTTP/1.1" 403 867 "-" "-" 
128.***.124.24 - - [18/May/2005:07:06:25 -0800] "GET / HTTP/1.1" 403 867 "-" "http://www.***.com/ <more 
 UA text>; Mozilla/4.0 compatible crawler" 
128.***.124.24 - - [18/May/2005:07:06:25 -0800] "HEAD / HTTP/1.1" 200 0 "-" "-" 
128.***.124.24 - - [18/May/2005:07:06:25 -0800] "GET /favicon.ico HTTP/1.1" 403 867 "-" "http://www.***.com/ 
 <more uA text>; Mozilla/4.0 compatible crawler" 
128.***.124.24 - - [18/May/2005:07:06:25 -0800] "GET /403info.html HTTP/1.1" 200 8043 "-" "-"

The bot came in with no user-agent and no referrer, and was denied access.
It then came back with a user-agent, but an unknown user agent, and so was denied.
Next, it tried a HEAD request, all of which we allow except in the most aggregious cases.
Next, it tried fetching our Favicon, and was denied (unknown user-agent).
Finally, it fetched the 403 info page, which explains our site access policies, in case a human might actually read it.


64.**.136.196 - - [18/May/2005:08:49:50 -0800] "GET / HTTP/1.0" 200 21356 "-" "Mozilla/5.0 (X11; U; Linux 
 i686; en-US; rv:1.2.1) Gecko/20021204" 
64.**.136.196 - - [18/May/2005:08:49:56 -0500] "GET /accessv/mail.cgi?id=user HTTP/1.0" 200 162 "-" 
 "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021204" 
64.**.136.196 - - [18/May/2005:08:49:56 -0800] "GET /accessv/user_auth.html HTTP/1.0" 200 162 "-" "Mozilla/5.0 
 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021204" 
64.**.136.196 - - [18/May/2005:08:49:56 -0800] "GET /accessv.html HTTP/1.0" 403 670 "-" "Mozilla/5.0 (X11; 
 U; Linux i686; en-US; rv:1.2.1) Gecko/20021204"

This poor harvester came in, grabbed the home page, analyzed it, and then came back and immediately tried to request three 'poison' files in a row! On the last of three simultaneous requests, it discovered it was already banned by IP address, and crawled away to die.

So much pain and sorrow could have been avoided if they had only fetched and obeyed robots.txt... Requiescat In Pacem.

Resources:
Modified "bad-bot" script blocks site downloads [webmasterworld.com]
Blocking Badly Behaved Bots [webmasterworld.com]
A Close to perfect .htaccess ban list [webmasterworld.com], the WebmasterWorld classic in four(?) parts

Jim

jim_w

7:11 pm on May 18, 2005 (gmt 0)

Jim;

>>But you lose your implicit 'ownership' of the contents

In our case it is more than that. They �hot link� to our software after they scrape download.com. This costs us bandwidth and we have to constantly perform maintenance to keep them out, and time is money. So it is costing us X 2. I have even seen humans from these sites come in after a block, take the URL to the software, and then fix their web pages.

After banning referrals from scrapers-r-us.com from getting the downloads, I have seen them then hot link to directly to download.com. I told their ISP about it, and the ISP said that if we file a complaint about copyright, and if we were wrong, then we would have to pay attorney�s fees. So the ISP�s, apparently Google, and the scrapers themselves are the only people that support these unethical activities. I don�t have a clue why the ISP�s and G don�t do something about it. In the long term, they are indeed hurting themselves due to their lack of respect for the original content providers.

ownerrim

10:07 pm on May 18, 2005 (gmt 0)

"The 'penalty' is the loss of content-uniqueness, and your loss of control over which URL will be used/shown/linked-to"

I'd love to get input on this. I know of a few sites that occupy their positions high in the serps and this has little to do with their on-page content or page rank. Essentially, they do have good onpage content AND high pagerank----but to place high in a crowded field of pages (millions of other pages to compete with), they rely mainly on the strength of incoming links that have good anchor text.

So, even if a copier copies their sites word for word, it won't matter. That site won't make it onto page one of the serps because the copier won't have the thousands of incoming links from other sites with good anchor text. If a copier copies the site and has good pagerank on the copied page, it won't matter for the same reason.

These sites are where they are due to 1. pagerank, 2. content (good keyword relevance and density) and 3. lots and lots and lots of anchor-text-keyword-relevant links from lots of different websites on IPs.

Seems to me, the best way to defeat scrapers and copiers is 1. get as much pagerank as you can get (and never stop accumulating), 2. get as many inbound links as you can (and never stop accumulating), 3. optimize the site well, and 4. never stop adding content.

jim_w

11:09 pm on May 18, 2005 (gmt 0)

ownerrim;

On the other hand, that just gives the scrapers more stuff to scrape, doesn't it? ;-))

BigDave

1:04 am on May 19, 2005 (gmt 0)

I told their ISP about it, and the ISP said that if we file a complaint about copyright, and if we were wrong, then we would have to pay attorney�s fees.

Actually, if that is what they told you, then they are not completely correct. The US is not Britain.

In copyright cases, awarding attorney's fees is not automatic. In fact there are some fairly famous cases where they were not awarded.

Fogarty v. Fantasy Inc. is one that comes to mind. <added> I believe he eventually won the fees, but they were never ruled to be automatic</added>

The key word in section 505 is "may".

mblair

1:09 am on May 21, 2005 (gmt 0)

digitalv,
I have heard of these spider traps before and have considered implementing one but have a question -- I have read in the Google forums here that some have reported that Google has been actually following links despite being disallowed in robots.txt but then not indexing them. I suppose the same concern could run to Google's mediabot as this might do the same.
In these cases are you whitelisting known Google IP ranges or has this just not been a problem thusfar?

stapel

3:06 am on May 21, 2005 (gmt 0)

I've used a spider trap for a few years now, and I'm still listed high in Google. It shouldn't be a problem, I don't think.

Eliz.

digitalv

9:18 pm on May 27, 2005 (gmt 0)

digitalv,
I have heard of these spider traps before and have considered implementing one but have a question -- I have read in the Google forums here that some have reported that Google has been actually following links despite being disallowed in robots.txt but then not indexing them. I suppose the same concern could run to Google's mediabot as this might do the same.
In these cases are you whitelisting known Google IP ranges or has this just not been a problem thusfar?

I think the issue was that Google stored the robots.txt file at the time of spidering and didn't retrieve it before. So if Google spiders you today (without this "trap" in place) they have your current robots.txt on file. If you change your robots.txt, the next time they spider you they're going off of the one they have stored on their end, not your most current one. So Google would fall into your trap and get banned.

The solution was to set up robots.txt with the nofollow in there but NOT to make the page do the automatic banning right away. After you've been spidered by all of the major search engines with your new robots.txt in place, THEN you add the auto-ban code to the nofollow page.

kgun

3:18 pm on Jun 29, 2005 (gmt 0)

Look at the cPanel and compile your own list.