dstiles - 9:39 pm on Feb 28, 2011 (gmt 0)
> Does anyone have any evidence of scraping via ISP or other public caches?
Two points, probably minor (they are for me):
1. TalkTalk (UK ISP) are sending a chinese bot to re-read a file after it has been loaded by one of their customers. Claimed to be anti-virus but what use it is arriving 30 minutes after the original page read rather kills that excuse. Also illegal under UK/EU law.
2. UK schools often use a common proxy service. Some of these are a bit scrapy. A few UK ISPs also use cache but I haven't seen so much mis-configuration in recent years.
Apart from that there is some good news:
If you block the first access attempt, usually to the default home page, then unless the "scraper" knows about your complete site, that is the only page that will be attempted. This alone will diminish the bandwidth overhead. Feed the scraper a purely minimal 403 whatever and that reduces it further. You need some kind of blocking mechanism for this - if you have htaccess then that should do it.
Block any country you don't want by IP range (there are public databases available for download - look into rbldnsd (wrbldnsd if Windows) which is for mail servers but could probably be adapted for web servers).
Block all server farms as discovered.
Again if Windows: add serious IP offenders into IIS Directory Security - except from reading this thread you probably can't as it's not your server. :(