homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Announcing find spiders.pl

 11:12 pm on Feb 21, 2014 (gmt 0)

I've been using a script I threw together a long time ago to get a quick view of what spiders are hitting my sites. It's nothing fancy and I kind of cringe when I look at some of the code, but it doesn't have any dependencies and the nice thing is that it does host lookups (once per ip) so you don't have to go looking up whether that IP was really from Google or not any more. It's a nice thing to run in cron; you can have it email you a report.

GitHub makes it so easy to put code out there that I figured why not, so here it is:



find_spiders.pl - A script to find spiders from apache web logs and report on them.


This is old code but has tended to work like a charm for me. You can put this in a cron and get daily emails about who's hammering your site. It's also helpful for forensic analysis after some jerk crawler takes your server down. One nice feature is that it does hostname lookups on the bots IPs (once-per-ip), so it's easier to tell if it's a bot that's actually from google or another legit search engine.


Analyze and report on the last 1000 lines of your domain's apache log.

./find_spiders.pl -f /home/domlogs/YOURDOMAIN.com -l 1000



 5:24 am on Feb 22, 2014 (gmt 0)

Good man!

Good ol'perl!


 5:51 am on Feb 22, 2014 (gmt 0)

Got a few massive log files to try this out on and see if it finds anything I missed.

Thanks for sharing!


 8:49 am on Feb 22, 2014 (gmt 0)



 4:14 pm on Feb 23, 2014 (gmt 0)

Thanks guys. If you're not a git user, here's the short course on how to download this from the command line:

git clone git@github.com:physicsdude/FindSpiders.git

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved