Forum Moderators: phranque

Message Too Old, No Replies

Blocking badly behaved spiders

Such as IE Offline content or those that ignore robots.txt

         

sublime1

5:42 pm on Dec 16, 2004 (gmt 0)

10+ Year Member



Last night, we had something like 10 distinct IPs, all with IE user agents hitting our site as fast as they could. None obeyed robots.txt. And for a short time, our site was overwhelmed.

I would like to identify these automated user agents and temporarily block them. But I need to make sure I don't block the real search engine spiders.

I have a rewrite rule that Apache uses to look in a file for IPs to block (and then send a Forbidden) status back right away.

My thought is to maintain a cache of addresses that are hitting us frequently but whose user agents are not those we know we want. Once an IP exceeeds some threshold, we write their address to the list that Apache looks at and they are banned.

I would prefer not to reinvent this wheel -- does anyone know of some existing strategies or software that will help me with this problem? Our code is Java running in Tomcat, and our web server is Apache, all running on Linux.

Thanks in advance!

Sublime1

kpaul

5:52 pm on Dec 16, 2004 (gmt 0)

10+ Year Member



mod_throttle for apache. while i haven't tried it myself yet, i have similar problems and am thinking it will help. lemme know if you install it and it works - i haven't had time to try yet ;)

-kpaul

sublime1

8:19 pm on Dec 16, 2004 (gmt 0)

10+ Year Member



Thanks --

I looked at a mod_throttle I found via Google and it appears to work the other direction: specifying how much bandwidth a given site (virtual host, I guess) on a given web server can use.

I am looking for something that would detect patterns of incoming requests that tend to identify a spider, for example an excessive rate of requests from a given host.

kpaul

8:34 pm on Dec 16, 2004 (gmt 0)

10+ Year Member



i'm pretty sure it does what you're looking for. you can set it up with rules (from what i've read) so that if a particular IP or useragent hits your site more than x times per second or uses more than x megs in bandwidth, it will be stopped, delayed, etc.

or maybe i'm misunderstanding your request?

-kpaul

edit to add from the first goog site: [[Also mod_throttle can track and throttle incoming connections by IP address or by authenticated remote user.]]

The Contractor

8:44 pm on Dec 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mod_throttle is exactly what it sounds like (throttle's resources). Do not use it if you run any scripting on your site because many perl scripts etc will not run (at least the ones I have tried). It is NOT what you want to control bots etc.

Use .htaccess to block bad bots or visitors you don't care to have. Many useful posts on this subject by searching this site like here [google.com] and here [google.com].

kpaul

8:51 pm on Dec 16, 2004 (gmt 0)

10+ Year Member



thanks contractor.

sorry for the bad advice ;)

-kpaul

sublime1

9:01 pm on Dec 16, 2004 (gmt 0)

10+ Year Member



Thanks Contractor. I was mostly having problems coming up with the right search terms to use to search WebmasterWorld :-)

The ban-bot script was the kind of thing I was looking for and I am sure there are other nuggest in there that I can use.