Forum Moderators: phranque

Message Too Old, No Replies

Banning Bots Revisited

Been there, banned that - no clean solution in the war on spiders?

         

yosmc

8:27 pm on Aug 27, 2004 (gmt 0)

10+ Year Member



I thought it might be about time time to come up with an age-old topic again - how to get rid of those nasty spambots and spiders that don't do anything meaningful other than suck up site resources. I've been looking into this for the last couple of weeks, and to my astonishment I couldn't find a solution that seemed decent enough to do the job. The reason I was surprised is that it's a real and obvious threat to any webmaster, and while uncounted (and pretty good) solutions exist to fight email spam, the bot wars seem to remain the dedicated hobby of a few.

Anyway, here's what I tried.

First of all I decided to do something about those bots that request dozens of pages at the same time. I installed mod_limitipconn (limits the simultanious http connections from a single IP), and after dealing with some flaws I finally got it to deny just real bots and leave innocent surfers alone. The module however just blocks access by returning a 503 error, it doesn't have the ablility to add IPs to a permanent ban list, so it basically just denies excess connections and for the rest, the spambots get a free ride.

Speaking of ban list: next thing I did was set up that infamous spider trap where you place a hidden link on your site and disallow it via robots.txt, in order to catch all bots that don't bother to read that file. The collected IPs would then be added to your .htaccess files (obviously via a world-writable ban.txt file, as making .htaccess itself world writable isn't much fun). The problem is that in a single-user environment like my server, I have .htaccess files generally disabled for performance reasons, and enabling them just to disarm a few bots (thus slowing down performance) would in itself taste like defeat.

So to avoid shooting my own foot I took a look at Perl Modules. Carfac (from this site) was very helpful trying to assist me getting Apache::BlockAgent and his modified Apache::BlockIP going, but we had no luck - something on my machine (either rh9 or Apache2) just didn't like it. Are there alternative modules? Not that I know of - but if there are, I expect them to be old and cause similar problems.

And then there's the problem of banning innocent dialup users, just because some moron ran an email harvester from the same ISP. I've seen a solution that blocks IPs for just a minute and doubles that interval every time the same IP is caught again. The system uses MySQL (throw some resources into the war to regain others), and I'm less than sure if it's efficient enough to punish bad bots with a relatively short temporary ban, unless they're dim enough to request the same useless trap page again and again. Is it better to sacrifice your dialup users? Of course not. Are there approaches to distinguish between static and dynamic IPs for this matter - not that I know of.

Possible conclusion: It may just as well be that fighting bots is a nice sport for the webmasterly ego (so you can hang their heads over your fire place), but it isn't really worthwhile from a practical point of view. If I evaluate the resources I have invested to tackle the problem so far, and add the resources I might invest in the future, it might make more sense to just go and order a faster server that can handle both legitimate users and bad bots without getting a hickup. It seems like other webmasters have realized this long ago, which may be why there are just a couple of half-baked solutions floating around out there for what I perceive to be a serious problem.

Older WebmasterWorld threads for reference:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
...and dozens more.

jdMorgan

8:55 pm on Aug 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> The collected IPs would then be added to your .htaccess files (obviously via a world-writable ban.txt file,
> as making .htaccess itself world writable isn't much fun).

Just a note: This problem, like your problem with carfac's system, is likely due to a configuration problem. It is not necessary to make .htaccess world-writable in order for this to work. With the correct server set up, owner-read/write, world-read (ow:rw,wo:r or 604) is sufficient.

Jim

DaveAtIFG

9:01 pm on Aug 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It may just as well be that fighting bots is a nice sport for the webmasterly ego
Most quality hosts in the US are generous with bandwidth and I've been able to ignore this problem for the most part, no "overage" charges. But for someone who is paying for bandwidth it's a serious issue.

With everyone from Commission Junction to script kiddies launching spiders, the problem will only get worse. When it becomes as big an issue as email spam, governments may intervene, eventually. Until then, someone that develops an effective solution could make a lot of money! :)

yosmc

9:25 pm on Aug 27, 2004 (gmt 0)

10+ Year Member



Just a note: This problem, like your problem with carfac's system, is likely due to a configuration problem. It is not necessary to make .htaccess world-writable in order for this to work. With the correct server set up, owner-read/write, world-read (ow:rw,wo:r or 604) is sufficient.

Jim, you're right or course if I run a perl script, but as it is I have cgi generally disabled on my server, and I'm not sure if I want to enable it for just a single script. Php however typically seems to need the file to be world-writable, from what I've heard that's not carved in stone but changing that common fact isn't that simple either, and trying to might get me out of the frying pan into the fire. :)

Just for the reference, the problems with carfac's system were about certain commands in the script not being available or causing errors.

yosmc

9:42 pm on Aug 27, 2004 (gmt 0)

10+ Year Member



It may just as well be that fighting bots is a nice sport for the webmasterly ego


Most quality hosts in the US are generous with bandwidth and I've been able to ignore this problem for the most part, no "overage" charges. But for someone who is paying for bandwidth it's a serious issue.

To elaborate on my remark, I've seen a lot of nifty scripts and slick coding solutions, but not quite as many thoughts on how much resources go into the solution to achieve what kind of result. I came up with the "head over the fire place" example because it seems the actual goal is sometimes forgotten, e.g. isn't it crucial whether a bot can be caught before, during or after the crime (=how much resources can actually be SAVED), or is that all unimportant as long as there's just another IP that can be added to the ban list. I know I'm exaggerating, but you get the idea. ;) That way, it's nearly impossible for someone like me (I'll celebrate my first dedicated server birthday next month ;) to even get a vague idea of what approach may (or may bot - sorry, NOT) be worthwhile.

ogletree

10:21 pm on Aug 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There are other reason for bot banning. With all these scraper sites out there I would love to get rid of those bots. Also I don't want people downloading my site and copying it or using that data to figure out my SEO. I would pay a monthly fee for somebody to take care of mine for me. All we do is bot traps but I clear out the ban list every once in a while to avoid banning real people. My site is easier to ban bots on because I don't expect anybody to stay on my site for more than a minute and I never expect them to come back.