Page is a not externally linkable
incrediBILL - 3:18 am on Jul 16, 2008 (gmt 0)
robots.pm (the robots list for awstats) is just a big long list of known bots and it seems they have some mechanisms to detect some additional details, but I'm not so sure it would be real good at detecting the types I'm talking about here that don't want to be detected in the first place. None of the bots in their list would ever skew the stats of humans vs. bots. If you merely sort out everything that isn't MSIE/FF/OPERA you have a good start and you can do that with a whitelist, not that big list of bots, but their list provides links to their owners sites which is nice. However, the bots I'm talking about always claim to be MSIE/FF/OPERA so you have to have ranges of hosting data centers and such to filter out all of the other automated noise. Once you've done that, then you have to filter out rogue activity which isn't always so obvious, things that aren't even stored in the log files. For instance, a scraper using AOL will hop from IP to IP on a timer and the only way to really tell it's the same scraper is if that scraper is accepting your cookie which many do these days. You don't find cookie data in log files. I could go on and on, but there's quite a bit of information that you don't see in a post-mortem analysis simply because the data retention would be astronomical and so would the time to process it all.
what of awstats