joined:Feb 27, 2001
I've been using a script I threw together a long time ago to get a quick view of what spiders are hitting my sites. It's nothing fancy and I kind of cringe when I look at some of the code, but it doesn't have any dependencies and the nice thing is that it does host lookups (once per ip) so you don't have to go looking up whether that IP was really from Google or not any more. It's a nice thing to run in cron; you can have it email you a report.
GitHub makes it so easy to put code out there that I figured why not, so here it is:
find_spiders.pl - A script to find spiders from apache web logs and report on them.
This is old code but has tended to work like a charm for me. You can put this in a cron and get daily emails about who's hammering your site. It's also helpful for forensic analysis after some jerk crawler takes your server down. One nice feature is that it does hostname lookups on the bots IPs (once-per-ip), so it's easier to tell if it's a bot that's actually from google or another legit search engine.
Analyze and report on the last 1000 lines of your domain's apache log.
./find_spiders.pl -f /home/domlogs/YOURDOMAIN.com -l 1000