Forum Moderators: open
I am currently in the process of developing a script for automatic spider detection / blocking. It does that by comparing user agents to known malicious spiders, it tracks the behaviour of clients (i.e. No. of requests per sec / minute etc.), uses hidden links to trap spiders etc.
Once a malicious spider is detected, it automatically blocks its IP.
I think what I have so far works pretty well. What I didn't quite figure out yet is how to detect spiders that spoof their user agent (i.e. pretend to be MSIE or Netscape). I have thought about using HTTP_ACCEPT and checking for image support, but I don't know how reliable that would be. Also, I found that if I refresh a page in MSIE 6.0, the HTTP_ACCEPT changes to */* instead of listing all supported media formats.
Btw, how reliable is the Robot meta tag. Is that supported / respected by most of the major SE's? I'd like this script to work on shared hosts, so robots.txt is not an option.
Any input and ideas greatly appreciated. Thanks