lucy24 - 11:52 pm on Jan 20, 2013 (gmt 0)
One of the things I do to check for unidentified robots in raw logs to to pull out all the lines with "GET /robots.txt and one with all the "GET / HTTP/1.1" lines and another for all the lines with HTTP/1.1" 304 responses. With these I can go look at the activity in context and see what was going on. It also makes it easier to check against my long list of bad IPs. It is slow and done manually but I only do this for an update audit. If I find I need more data I look at more logs for the same site. I know there are lots of people here who could do these things dynamically but I have decided I'll never have the time available to learn all that. Just added them as ideas.
About the same here.
Logs of course have one advantage over real-time activity: you can see what the next request will be. So f'rinstance if I get a request for robots.txt followed by a 403 from the same IP, then both lines get chopped out of the log-wrangling routine and I don't have to think about them. The 403 may not even be IP-based; all that matters is that this source has already been Dealt With. The only ones that need brain-and-eyeball attention are the robots.txt requests from unknown sources.
In my case it also helps that I'm not a front-driven site. Most one-off robots go no further than the front page, which no human ever visits except on the way to somewhere else, and those go in the "no skin off my nose" category.
Another check I've added recently is any large number of page requests from the same IP. It might be someone spreading out and looking at lots of your pages-- which is gratifying when it happens :) --but it may also be a robot harvesting everything in sight.
And then there are the auto-referers. mod_rewrite pigheadedly refuses to let me block these upfront (that is, it's syntactically not possible) but in your own log-wrangling script it's trivial. So when someone comes by and asks for robots.txt, giving robots.txt as referer... well, that's what linguists call Double Markedness.