Forum Moderators: phranque
Anyway, here's what I tried.
First of all I decided to do something about those bots that request dozens of pages at the same time. I installed mod_limitipconn (limits the simultanious http connections from a single IP), and after dealing with some flaws I finally got it to deny just real bots and leave innocent surfers alone. The module however just blocks access by returning a 503 error, it doesn't have the ablility to add IPs to a permanent ban list, so it basically just denies excess connections and for the rest, the spambots get a free ride.
Speaking of ban list: next thing I did was set up that infamous spider trap where you place a hidden link on your site and disallow it via robots.txt, in order to catch all bots that don't bother to read that file. The collected IPs would then be added to your .htaccess files (obviously via a world-writable ban.txt file, as making .htaccess itself world writable isn't much fun). The problem is that in a single-user environment like my server, I have .htaccess files generally disabled for performance reasons, and enabling them just to disarm a few bots (thus slowing down performance) would in itself taste like defeat.
So to avoid shooting my own foot I took a look at Perl Modules. Carfac (from this site) was very helpful trying to assist me getting Apache::BlockAgent and his modified Apache::BlockIP going, but we had no luck - something on my machine (either rh9 or Apache2) just didn't like it. Are there alternative modules? Not that I know of - but if there are, I expect them to be old and cause similar problems.
And then there's the problem of banning innocent dialup users, just because some moron ran an email harvester from the same ISP. I've seen a solution that blocks IPs for just a minute and doubles that interval every time the same IP is caught again. The system uses MySQL (throw some resources into the war to regain others), and I'm less than sure if it's efficient enough to punish bad bots with a relatively short temporary ban, unless they're dim enough to request the same useless trap page again and again. Is it better to sacrifice your dialup users? Of course not. Are there approaches to distinguish between static and dynamic IPs for this matter - not that I know of.
Possible conclusion: It may just as well be that fighting bots is a nice sport for the webmasterly ego (so you can hang their heads over your fire place), but it isn't really worthwhile from a practical point of view. If I evaluate the resources I have invested to tackle the problem so far, and add the resources I might invest in the future, it might make more sense to just go and order a faster server that can handle both legitimate users and bad bots without getting a hickup. It seems like other webmasters have realized this long ago, which may be why there are just a couple of half-baked solutions floating around out there for what I perceive to be a serious problem.
Older WebmasterWorld threads for reference:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
...and dozens more.
Just a note: This problem, like your problem with carfac's system, is likely due to a configuration problem. It is not necessary to make .htaccess world-writable in order for this to work. With the correct server set up, owner-read/write, world-read (ow:rw,wo:r or 604) is sufficient.
Jim
It may just as well be that fighting bots is a nice sport for the webmasterly egoMost quality hosts in the US are generous with bandwidth and I've been able to ignore this problem for the most part, no "overage" charges. But for someone who is paying for bandwidth it's a serious issue.
With everyone from Commission Junction to script kiddies launching spiders, the problem will only get worse. When it becomes as big an issue as email spam, governments may intervene, eventually. Until then, someone that develops an effective solution could make a lot of money! :)
Just a note: This problem, like your problem with carfac's system, is likely due to a configuration problem. It is not necessary to make .htaccess world-writable in order for this to work. With the correct server set up, owner-read/write, world-read (ow:rw,wo:r or 604) is sufficient.
Just for the reference, the problems with carfac's system were about certain commands in the script not being available or causing errors.
It may just as well be that fighting bots is a nice sport for the webmasterly ego
Most quality hosts in the US are generous with bandwidth and I've been able to ignore this problem for the most part, no "overage" charges. But for someone who is paying for bandwidth it's a serious issue.