Forum Moderators: phranque

Message Too Old, No Replies

Making Keep-Alive -- a killer. Doable?

ISO alternative to throttling modules (Apache/1.3.x)

         

Pfui

8:03 pm on Mar 11, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Because I'm unable to install throttling modules [webmasterworld.com], I was wondering if asking my SysAdmin to tweak the Keep-Alive directive(s) would help slow down too-rapid accesses by unauthorized bots and scrapers and unintentional Save-As whackers?

Currently the setting* / returned header is:

Keep-Alive: timeout=20, max=200

(From "Apache Keep-Alive Support [httpd.apache.org]" -- "Set max-requests to the maximum number of requests you want Apache to entertain per connection. A limit is imposed to prevent a client from hogging your server resources...")

The server primarily exists for a growing, text-intense site with a 250,000-post archive (heavily hit non-CGI pages total ~150-200), so server-wide config changes are A-OK.

The "Apache Core Features [httpd.apache.org]" docs for 1.3 describe KeepAlive, KeepAliveTimeout, and MaxKeepAliveRequests directives so I'm not sure if, say...

Keep-Alive: timeout=10, max=20

...would strike a good balance between runaway hits by some and server performance for all. Thoughts?

Also, what happens when max=X is set low? Would the effect be akin to a per-request "sleep 5" or somesuch in those browsers supporting Keep-Alive?

TIA for your assistance!

.
*REF (from current "httpd.conf"):

# 
# KeepAlive: Whether or not to allow persistent connections (more than
# one request per connection). Set to "Off" to deactivate.
#
KeepAlive On

#
# MaxKeepAliveRequests: The maximum number of requests to allow
# during a persistent connection. Set to 0 to allow an unlimited amount.
# We recommend you leave this number high, for maximum performance.
#
MaxKeepAliveRequests 200

#
# KeepAliveTimeout: Number of seconds to wait for the next request from the
# same client on the same connection.
#
KeepAliveTimeout 20

incrediBILL

6:42 pm on Mar 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Try using AlexK's throttling code in PHP
[webmasterworld.com...]

You might have to read thru a bit to find the installation instructions, but it appends a PHP call to the beginning and end of all the pages and the link to the current script on his server is in there somewhere.

Basically, they hit too fast and he just gives them server busy error and the faster they hit the longer his timeout increases so the bad bots just put themselves out of business.

Pfui

8:00 pm on Mar 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for the link, Bill. (Aside: Like you, I block all non-Moz hits and then weed out the varietals.) If only Alex's "bot-block.php" magic came in .cgi or .pl because PHP isn't installed and I'm reluctant to request that for just one script...

Then again, seeing as how some dolt hit the same page 1091 times last night, I just might.

But for right now, here's hoping SysAdmin-easy Keep-Alive tweaking proves to be an option. Ever tried that?

incrediBILL

8:30 pm on Mar 12, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Nope, never tried it, but there are some spider traps / bot blockers in perl.

Search Google, I ran across a few yesterday looking for other information blocking spambots or some such, I think I searched "htaccess bot blocker" and hit all sorts of stuff so they exist.

Pfui

12:19 am on Mar 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thank you again! Actually, I have bot traps in place (blank gifs, and blank spaces, the latter courtesy of jdM's expertise) but I rarely catch anything worth getting excited about, probably because I'm already blocking hundreds and hundreds of bots via an obsessive combination of access_log eyeballing and grepping, and mod_rewrite tweaking:)

Hands down my biggest problem is throttling individual visitors from non-suspect ISPs using normal browsers and doing what I'm guessing is a variation of Save-As and/or making an offline archive or who-knows-what and ripping everything in a matter of minutes. At least they look like individuals...

I've even rigged site maps in frames and those do make it harder for savvier rippers, but most of my pages are carefully intra-linked for visitor convenience so pretty much any page is as good a rip-start as any other.

Heck, when I used a microwave link to test a number of Firefox extensions and see how easy it might be to intentionally and accidentally rip my own stuff, the hit rate not only upped my load averages but just about set my hair on fire.

=====8-0)

So anyway, sorry to digress. This throttling thing is just driving me more than a little nuts and you made the mistake of replying -- twice:) So again, thanks!

incrediBILL

2:08 am on Mar 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I feel your pain which is why I wrote my own bot blocker as the idiots knocked me offline for 90 minutes once which was the worst ever and I knew I had to take action.

AlexK's script does throttle them but doesn't make decisions about blocking them permanently.

I'm tracking them in real time, blocking bad offenders permanently, and since blacklisting bots is a no-win scenario I only let in whitelisted bots, Google, Yahoo, MSN, etc. and everything else is bounced by default except Mozilla/Opera. Even then, I profile these as many scrapers mask as browsers and I watch for the stupid things they do like speed, SGML / HTML errors in URIs, bad neighborhoods like a couple of server farms harboring scrapers, etc.

It's getting quite complex and I think I'm stopping 95% or more at thing point as I rarely see anything sneak past, especially with the max page throttle.

If if goes public it'll probably start in PHP though.