-- Search Engine Spider and User Agent Identification
---- Stale bad bot lists
lucy24 - 10:27 am on Mar 25, 2012 (gmt 0)
Fourth, filter by behavior, header content, loads CSS, js, images, etc.
Well, of course. But that's all after the fact. My log-wrangling has a multi-layered pattern of: human, probably human, maybe human, doubtful and (very very rare) definitely robot. Does it pick up the favicon (darn those mobile devices for destroying a perfect test!), the css (the plainclothes bing/msie robots always walk away with errorstyles.css), come in from a search engine with a plausible query? If it's using g### translate do all the images go to a second, human IP?
A good day is one that doesn't call for reopening the logs to verify someone's humanity. A really good day is one with no unfamiliar robots because all visitors are already on the Ignore list-- one way or the other.
The question was, how can your htaccess (acting as bouncer) identify a new robot that it hasn't met before? Standard human behaviors like requesting the full set of images can't happen until after the critter itself has been admitted. Standard robotic behaviors like going home to a server farm in Moldavia don't become obvious until after you've identified a new robot and looked up where it lives. ("And the horse you rode in on.")