Forum Moderators: open

Message Too Old, No Replies

Detecting spoofed user agents

Best method?

         

Demeter

3:47 pm on Oct 7, 2003 (gmt 0)

10+ Year Member



Hi all,

I am currently in the process of developing a script for automatic spider detection / blocking. It does that by comparing user agents to known malicious spiders, it tracks the behaviour of clients (i.e. No. of requests per sec / minute etc.), uses hidden links to trap spiders etc.
Once a malicious spider is detected, it automatically blocks its IP.

I think what I have so far works pretty well. What I didn't quite figure out yet is how to detect spiders that spoof their user agent (i.e. pretend to be MSIE or Netscape). I have thought about using HTTP_ACCEPT and checking for image support, but I don't know how reliable that would be. Also, I found that if I refresh a page in MSIE 6.0, the HTTP_ACCEPT changes to */* instead of listing all supported media formats.

Btw, how reliable is the Robot meta tag. Is that supported / respected by most of the major SE's? I'd like this script to work on shared hosts, so robots.txt is not an option.

Any input and ideas greatly appreciated. Thanks

bcolflesh

3:54 pm on Oct 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You can't detect that a UA has been spoofed, if they use a valid UA - the other methods you have in place (behavior and hidden link tracking) will hopefully detect the bad ones.

Demeter

7:09 am on Oct 8, 2003 (gmt 0)

10+ Year Member



Hmmm ...

What about hit rates to distinguish "good" from "bad". Can I rely on good spiders to crawl a site at lower speeds than a bad one? How many hits / minute would you expect from a well behaved SE spider?

Thanks