Forum Moderators: DixonJones

Message Too Old, No Replies

tracerouting tools malicious bots.

         

casimir1234

6:22 pm on Aug 4, 2005 (gmt 0)

10+ Year Member



All..can anyone tell me how to identify bad bots..Or are there any bot tracking tools out there..Ones that would do an analysis of logs and report back on bot/crawler activity only. This stuff can be developed and I have ideas for algorithms however no time right now..Does this exist..One that would take into account time between get requests on the server(s) etc..I work for a huge company and one of my duties is to build reports with webtrends..We have many external sites and a huge .com site. (We have 40,000 employess)
If I was a malicious bot I would definetly have an inconspicuous UA string.

ronburk

8:51 pm on Aug 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



can anyone tell me how to identify bad bots

I expect there is a lot of variation in what people feel is a "bad" bot, so it's not completely clear what you're after.

If you just want to locate crawlers who fail to identify themselves or disobey robots.txt, a couple of honeypot links accompanied by standard web log reporting software is probably adequate.

a) on your home page, install a link that is not visible to humans (e.g., hotlink a space character, using no color or underline). Ask your web logging software to report on all fetches to that URL.

b) in your robots.txt, disallow access to /secretstuff (can exist or not, according to your needs, but make sure no human-clickable links to it exist). Ask your web logging software to report on all accesses thereto.

If your logging software is up to it, you can also then just pick out the IP addresses associated with rude crawlers, and ask for a display of everything else they did that day.

If you're the NSA, of course, then you might be dealing with more sophisticated attackers using distributed techniques. This eventually gets pretty hard to deal with, but one way to at least notice it's happening is to identify (create if necessary) some pages that are rarely, if ever, the initial landing page of a new visitor. Just have your web logging software report on anybody who landed on one of those pages as the initial GET of their session.