Forum Moderators: phranque

Message Too Old, No Replies

Efforts To Control Scraper Access

Optimizing Requests Through http.conf

         

SevenCubed

4:20 pm on Jul 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



More and more each day I'm understanding and admiring the power of Apache and so I'm digging deeper into the Docs to try to understand the full extent of control that is available. I came across this following simple directive (LimitRequestLine) but as is the case with Apache Docs they aren't very good at giving in depth understanding or good examples. I think the Docs are geared more towards a knowledgeable user and are provided as a reference rather than to try to help the uninitiated.

So anyway here's my scenario:
I've been trying to build as good an environment as possible to limit or completely eliminate a scraper's ability to grab the whole site in a minute or less as they sometimes succeed in doing from some obscure IP. My efforts have been restricted to firewalling blocks of CIDR ranges but that method is like closing the barn door after the horses find freedom...and they're off! In the long run it will be good for future sites that I take onto the server but it's doing nothing for mine that is now there :(

I found this...

#LimitRequestLine Directive
LimitRequestLine 4094

My question is would including that simple line in my httpd.conf file be an effective deterrent by crippling a scrappers ability to "smash and go"? I'm guessing I would have to evaluate the domains, determine the maximum file size including all elements of the largest page and maybe then adjust that default value of 4094 (plus fudge factor x2?) accordingly to prevent any one connection or client from downloading the whole site in one scoop (on a good day, for them, they usually grab about 2 or 3 pages per second according to log files). Another thought that occurs as I write this is that those scrapers typically do not grab images so maybe a high enough value for legitimate visitors will be enough to still allow nasties to get the html anyway? I use the mod_deflate so all outgoing html is compressed, would that be a negating factor to render this effort useless?

Overall would this be an ineffective method of control or can anyone foresee that it may cause unwanted effects? Or yet still, am I totally off mark in understanding the purpose of this directive.

If it is a good deterrent what is the likelihood that it will discourage a scraper, or is the software or scripts used by scrapers programmed to have "patience" to wait out such restrictions imposed on them?

I'm approaching this post as the only stupid question is the one not asked! Many thanks to all opinions to follow here...

SevenCubed

4:24 pm on Jul 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oops, I just notice that the actual default value is 8190, not 4094 as I posted...same difference, it's the question and answer that I'm more interested in. Thanks.

jdMorgan

5:25 pm on Jul 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The LimitRequest directives may not help at all, because "loading a page" invokes multiple (sometimes hundreds) of separate HTTP requests -- The 'page' itself is loaded, the browser parses it, and then issues additional and separate HTTP requests for each and every included object on the page.

Download and install the Live HTTP Headers add-on for Firefox, clear your browser cache, and then load one of your pages to see how this works.

The LimitRequest directives apply to each of these HTTP transactions separately.

In fact, without using scripting to create a cookie and use it to track a 'session,' the server has no idea that these HTTP requests are related -- Each one is handled separately, and forgotten as soon as it has been handled; The server has no 'memory' of past transactions, and no expectations or requirements for future transactions based on the current one -- It has no native concept of anything beyond a single client HTTP request for a single object.

Scrapers can be controlled by user-agent, by IP address (or IP address range) using a firewall or server-side scripting such as 'bad-bot' traps that detect robots.txt violations, and by the fact that many scrapers are badly-written -- they 'make mistakes' such as incorrectly spoofing a user-agent, or the HTTP request header behaviour normally seen with a specific 'real' user-agent. If you see a request from "Googlebot," does the IP address resolve to Google? -- Most don't, but they certainly should... :)

There are bad-bot scripts in both our PHP and PERL forum libraries here at WebmasterWorld (They do different things, and so should be evaluated for function, rather than based solely on scripting language). User-agent- and IP-address-blocking are much discussed in our Search Engine Spider and User Agent Identification forum.

Unfortunately, there is no simple or easy solution, and you must consider the impact of each method and your implementation of it in light of your own site's needs. So there is no 'one size fits all' solution, either. One Webmaster might block access from all of the Asia-Pacific and Africa regions and parts of Europe as an effective expedient, while doing that on a different Web site might put it out of business in less than a week...

Jim

SevenCubed

5:39 pm on Jul 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok thanks for the feedback. Back to the drawing board...

"...It has no native concept..." -- ummmmm yes it does because it's in its DNA - Apache :p