Filtering bots from traffic reports

Forum Moderators: open

Message Too Old, No Replies

Filtering bots from traffic reports

Agent name and/or IP address?

aspdaddy

3:37 pm on Mar 31, 2003 (gmt 0)

To filter out the bots,spiders,harvesters etc.. is it best to use a list of known user agents or known IP addresses/hosts? Any other strategies?

Is it common for a bot use many IP addresses, but if I filter the IP address out, could I be filtering real traffic.

Any decent articles on this subject?

Thanks.

nonprof webguy

4:11 pm on Mar 31, 2003 (gmt 0)

Identifying bots can be tricky. There are always new user-agent names, so maintaining a list of bot user agents is a neverending job. Bots sometimes use fake user agent names to fool you into thinking they are a standard browser.

They also sometimes come from IP addresses that are owned by broadband ISPs that are also used by human browsers.

I've recently I've been exploring identifying bots by behavior. Using several honeypots that lure bots but not humans, in my spare time I've begun going through my recent logs putting each user-agent IP combination through a series of tests to identify whether that combo shows various kinds of bot behavior:

Did they request more than 10 pages within 20 seconds? more than 10 pages within 30 seconds?

Did they request robots.txt?

Did they request any of the honeypot pages, and if so, how many?

Did they NOT request style sheets?

Did they NOT request image files?

I don't yet know how many tests a given user-agent IP combination has to pass or fail to be identified as a likely to be a bot.

Over time, I'm hoping to discern patterns that can be associated with different kinds of bots. I'm also looking at the behavior of known bots like e-mail siphon to see if any other user-agent IP combinations in my logs exhibit similar behavior.

I don't, as yet, have any useful results, but I wanted to share the approach.

wilderness

4:25 pm on Mar 31, 2003 (gmt 0)

Try a search on "UMBC Agent Web" plenty of reading there although not specific to your inquiry.

Don't recall (course I haven't looked in a while) a specific page giving lengthy explantions. At least not gathered together in one document.

Bots are rapidly changing with new innovations constant. It is never-ending. :-(

aspdaddy

4:55 pm on Mar 31, 2003 (gmt 0)

Thanks for the feedback guys.

nonprof, id'ing robots by behaviour is a nice idea - check out research papers by Tan & Kumar.

Andrue

6:40 pm on Mar 31, 2003 (gmt 0)

I found tons of info just by putting a few keywords in google.com:

spam
robot
bad bot
UA list

a combination of those or even just putting the IP or the UA of a suspected bot along with bot or spam in the keyword will turn stuff up...

actually that's how i found this board, most of the information i have found useful came from here.

Macguru

6:45 pm on Mar 31, 2003 (gmt 0)

Hi aspdaddy,

Knowing what kind of stat software you are using could help a bit. For some packages, you can get fairly recent updated files.