How to protect my content from scrapers

Forum Moderators: phranque

Message Too Old, No Replies

How to protect my content from scrapers

ecommerceprofit

7:36 am on Mar 11, 2008 (gmt 0)

Can anybody enlighten me on how to protect my content from scrapers/crawlers? More importantly, is there a list of "good crawlers" ? I am hoping to let the good crawlers in and keep the bad content stealers out.

vol7ron

2:59 am on Mar 13, 2008 (gmt 0)

In your robots.txt

User-agent: *
Disallow: /

This should stop those search engines.
Your .htaccess files should be modified to limit access from problem causers.

The real way would be to use cookies and/or use require a login method that uses anti-bot sign-up methods.

lammert

8:11 am on Mar 13, 2008 (gmt 0)

User-agent: *
Disallow: /

This is the worst advice I have ever seen! All "good" bots will obey the robots.txt and stop crawling which causes your site to fall out of Google, Yahoo and Live, but all the bad bots and scrapers will still come and rip your site.

maximillianos

4:49 pm on Mar 14, 2008 (gmt 0)

You are asking a tough question. There are "lists" maintained around the web that share the good bots IP addresses, but this does not mean a bad bot can't mimick one of those IPs and still get past your filter. It is a tricky business.

One tactic you might consider is to have programs monitor your log files for any bots that seem to behave rudely, like pulling your pages too fast, or not crawling properly. You can then shut them down temporarily and send a warning to yourself to check it out. Things like that might help.

I also make sure to plug my content with a copyright notice and a link back to my site that is only visible to robots. I don't mind that search engines see it, not a big deal, but if someone else scrapes my site, they have to do some tricky programming to get rid of all my randomized (and randomly placed) copyright notices... ;-)

Hope this helps.

Good luck!

HRoth

11:18 pm on Mar 14, 2008 (gmt 0)

maximillianos, are you putting the copyright notices in the metatag description? Or is it actually in the middle of the content?

ecommerceprofit

7:08 am on Mar 20, 2008 (gmt 0)

Excellent advice maximillianos! Thanks!

Lorel

8:21 pm on Mar 20, 2008 (gmt 0)

You can also put a lot of links to other pages within your content with FULL urls but images with relative urls. Then if someone clicks on the links they will still end up on your site but your images won't show up on the scraper site causing the site to look unfinshed). Lazy scrapers won't bother copying your pages because too much work is involved to fix your code.

Also install base href tags--incase they scrape the whole page. This will help the search engine to determine where the original content came from.

[edited by: Lorel at 8:22 pm (utc) on Mar. 20, 2008]

karlaredor

2:31 am on Mar 28, 2008 (gmt 0)

For Wordpress there is AntiLeech plugin.

Ocean10000

11:33 pm on Mar 30, 2008 (gmt 0)

The following is a post I did Search Engine Spider Identification which I think answers your question.
Quick primer on identifying bot activity. [webmasterworld.com]

vol7ron

5:32 am on Mar 31, 2008 (gmt 0)

By Lammert:

User-agent: *
Disallow: /
This is the worst advice I have ever seen! All "good" bots will obey the robots.txt and stop crawling which causes your site to fall out of Google, Yahoo and Live, but all the bad bots and scrapers will still come and rip your site.

This will both stop good and bad, depending if they respond to the robot.txt file.

The best way is to search for the robots.txt file hack, whereby you throw off the bad robots by putting a file in your robots.txt that isn't in use. Block any crawler that tries to access that file. A similar is to set up a hidden email address and block any IP that tries to email that account.

Next time read my post lammert, it was stated correctly, which was answering the more general question that was being asked, "How to limit crawlers/spiders, in general, from viewing the site." Not only the bad.

The second question was inquiring about the list of good sites. But the htaccess is the best way to block the access to bad sites only, answering the second question. How you find these sites and set that up is in a post that was deleted.

[edited by: phranque at 11:01 am (utc) on Mar. 31, 2008]
[edit reason] please see TOS #24 [webmasterworld.com] [/edit]

Eric in Tennessee

8:05 pm on Apr 2, 2008 (gmt 0)

This will both stop good and bad, depending if they respond to the robot.txt file.
The best way is to search for the robots.txt file hack, whereby you throw off the bad robots by putting a file in your robots.txt that isn't in use. Block any crawler that tries to access that file. A similar is to set up a hidden email address and block any IP that tries to email that account.
Next time read my post lammert, it was stated correctly, which was answering the more general question that was being asked, "How to limit crawlers/spiders, in general, from viewing the site." Not only the bad.
The second question was inquiring about the list of good sites. But the htaccess is the best way to block the access to bad sites only, answering the second question. How you find these sites and set that up is in a post that was deleted.

I was thinking the same thing about the bad advice, but I have been out of this so long, I thought maybe I was missing something.

[edited by: Eric_in_Tennessee at 8:05 pm (utc) on April 2, 2008]

mikhaill

6:55 am on Apr 4, 2008 (gmt 0)

I guess one day I'll get around to writing this script but this is what I would do from my observation. Most of these scrapers only request the text page and never any images and css. I'd parse my log files for requests that only ask for the files and nothing else and then add them to a block list by IP and by user agent string (assuming they are not cloaking themselves as a regular browser).

ecommerceprofit

11:07 pm on May 3, 2008 (gmt 0)

Even more great advice - thanks! I just found this old discussion which has a ton of great info. in case any of you are interested:

[webmasterworld.com...]

youfoundjake

3:11 am on May 5, 2008 (gmt 0)

ecommerceprofit, if you are using apache, you can take a look at how I set up a bad bot trap, using php.
[webmasterworld.com...]

ecommerceprofit

4:01 am on May 7, 2008 (gmt 0)

Thank you!