homepage Welcome to WebmasterWorld Guest from 23.22.173.58
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Announcing find spiders.pl
Wow.
physics




msg:4647626
 11:12 pm on Feb 21, 2014 (gmt 0)

I've been using a script I threw together a long time ago to get a quick view of what spiders are hitting my sites. It's nothing fancy and I kind of cringe when I look at some of the code, but it doesn't have any dependencies and the nice thing is that it does host lookups (once per ip) so you don't have to go looking up whether that IP was really from Google or not any more. It's a nice thing to run in cron; you can have it email you a report.

GitHub makes it so easy to put code out there that I figured why not, so here it is:

https://github.com/physicsdude/FindSpiders

NAME

find_spiders.pl - A script to find spiders from apache web logs and report on them.

DESCRIPTION

This is old code but has tended to work like a charm for me. You can put this in a cron and get daily emails about who's hammering your site. It's also helpful for forensic analysis after some jerk crawler takes your server down. One nice feature is that it does hostname lookups on the bots IPs (once-per-ip), so it's easier to tell if it's a bot that's actually from google or another legit search engine.

USAGE EXAMPLE

Analyze and report on the last 1000 lines of your domain's apache log.

./find_spiders.pl -f /home/domlogs/YOURDOMAIN.com -l 1000

 

Angonasec




msg:4647688
 5:24 am on Feb 22, 2014 (gmt 0)

Good man!

Good ol'perl!

incrediBILL




msg:4647703
 5:51 am on Feb 22, 2014 (gmt 0)

Got a few massive log files to try this out on and see if it finds anything I missed.

Thanks for sharing!

keyplyr




msg:4647732
 8:49 am on Feb 22, 2014 (gmt 0)

Thanks!

physics




msg:4648323
 4:14 pm on Feb 23, 2014 (gmt 0)

Thanks guys. If you're not a git user, here's the short course on how to download this from the command line:


git clone git@github.com:physicsdude/FindSpiders.git

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved