Forum Moderators: phranque
I would like to identify these automated user agents and temporarily block them. But I need to make sure I don't block the real search engine spiders.
I have a rewrite rule that Apache uses to look in a file for IPs to block (and then send a Forbidden) status back right away.
My thought is to maintain a cache of addresses that are hitting us frequently but whose user agents are not those we know we want. Once an IP exceeeds some threshold, we write their address to the list that Apache looks at and they are banned.
I would prefer not to reinvent this wheel -- does anyone know of some existing strategies or software that will help me with this problem? Our code is Java running in Tomcat, and our web server is Apache, all running on Linux.
Thanks in advance!
Sublime1
I looked at a mod_throttle I found via Google and it appears to work the other direction: specifying how much bandwidth a given site (virtual host, I guess) on a given web server can use.
I am looking for something that would detect patterns of incoming requests that tend to identify a spider, for example an excessive rate of requests from a given host.
or maybe i'm misunderstanding your request?
-kpaul
edit to add from the first goog site: [[Also mod_throttle can track and throttle incoming connections by IP address or by authenticated remote user.]]
Use .htaccess to block bad bots or visitors you don't care to have. Many useful posts on this subject by searching this site like here [google.com] and here [google.com].