Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- Stopping scrapers from the get-go


wheel - 1:23 am on Feb 16, 2011 (gmt 0)


I'm putting a *huge* number of pages of content online. I'm looking to stop the scraping/copying/bots from the outset and I need bandwidth kept to a minimum. I've never done this before, so I'm not quite sure where to start.

Most of the content is on static html pages. My prelim reading suggests that may be problematic (since I'm not putting out the pages programatically).

Can anyone suggest details as to what I should be doing? Here's areas I think:
1) in htaccess, block a list of IP's from spamhaus
2) in htaccess block a large list of IP's from other countries?
3) in htaccess, block a lot of user agents (get the code from WebmasterWorld)?
4) White list Google, Yahoo and MSN in robots.txt
5) block google and the other bots from crawling my images. I think this will block all robots from crawling gif's at any level of my site?
User-agent: Googlebot
Disallow: /*.gif$

6) Then I think I'd like to block IP's from hosting companies. Is there an easy to use list of those IP's?
7) after that I should do some IP blocking dynamically I think. Like trigger a block if someone is crawling too many pages too fast. But since I'm serving static html, how do I do that? Set up a cron job to run a script every minute that reads the log and takes action? This seems complex and burdensome.
8) Since the content is static, Google and the rest don't need to download the html 8 times a month. Once a year is fine. What's the best way to tell the bots that a page hasn't changed, thus no need to crawl? etags? I think that stuff requires I change the page headers, and that's tough to do with static html pages.

Anything else I missed?


Thread source:: http://www.webmasterworld.com/search_engine_spiders/4267704.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com