Efforts To Control Scraper Access

More and more each day I'm understanding and admiring the power of Apache and so I'm digging deeper into the Docs to try to understand the full extent of control that is available. I came across this following simple directive (LimitRequestLine) but as is the case with Apache Docs they aren't very good at giving in depth understanding or good examples. I think the Docs are geared more towards a knowledgeable user and are provided as a reference rather than to try to help the uninitiated.

So anyway here's my scenario:
I've been trying to build as good an environment as possible to limit or completely eliminate a scraper's ability to grab the whole site in a minute or less as they sometimes succeed in doing from some obscure IP. My efforts have been restricted to firewalling blocks of CIDR ranges but that method is like closing the barn door after the horses find freedom...and they're off! In the long run it will be good for future sites that I take onto the server but it's doing nothing for mine that is now there :(

I found this...

#LimitRequestLine Directive
LimitRequestLine 4094

My question is would including that simple line in my httpd.conf file be an effective deterrent by crippling a scrappers ability to "smash and go"? I'm guessing I would have to evaluate the domains, determine the maximum file size including all elements of the largest page and maybe then adjust that default value of 4094 (plus fudge factor x2?) accordingly to prevent any one connection or client from downloading the whole site in one scoop (on a good day, for them, they usually grab about 2 or 3 pages per second according to log files). Another thought that occurs as I write this is that those scrapers typically do not grab images so maybe a high enough value for legitimate visitors will be enough to still allow nasties to get the html anyway? I use the mod_deflate so all outgoing html is compressed, would that be a negating factor to render this effort useless?

Overall would this be an ineffective method of control or can anyone foresee that it may cause unwanted effects? Or yet still, am I totally off mark in understanding the purpose of this directive.

If it is a good deterrent what is the likelihood that it will discourage a scraper, or is the software or scripts used by scrapers programmed to have "patience" to wait out such restrictions imposed on them?

I'm approaching this post as the only stupid question is the one not asked! Many thanks to all opinions to follow here...

Efforts To Control Scraper Access

Optimizing Requests Through http.conf

SevenCubed

SevenCubed

jdMorgan

SevenCubed

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week