homepage Welcome to WebmasterWorld Guest from 54.227.41.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / WebmasterWorld / Webmaster General
Forum Library, Charter, Moderators: phranque

Webmaster General Forum

    
Detecting Hard-hitting Bots via Live Stats
How to identify bots hitting your site
moishier




msg:343523
 8:56 pm on Nov 7, 2005 (gmt 0)

We have an issue with bots hitting our site hard ad slowing down the site. What strategies do people use to quickly identify bots and block them?

We use a Microsoft ISA server / Windows IIS server config.

 

JAB Creations




msg:343524
 4:58 am on Nov 8, 2005 (gmt 0)

You will want to let Google/MSN/Yahoo in of course and most search engines use those indexes instead of using their own bots.

You can use robots.txt to keep some/all bots from accessing certain files. For example, if all your images are in www.example.com/images then you could just deny access to bots via robots.txt if you have no interest in being found via Google images.

I use a version of awstats that I've personally modified to detect an extensive number of browsers and robots though by default it is half decent. So far this month...

Yahoo - 2080+242 - 17.60 MB
MSN - 1400+46 - 30.11 MB
Google - 520+13 - 4.79 MB
(+242 = 242 hits on robots.txt)

Those are the three biggies... MSN tends to be kind of a bandwidth whore so I would suggest finding out where MSN is crawling that may be costing more bandwidth then you desire. Here are links to the major three's bot pages...

[search.msn.com...]
[help.yahoo.com...]
[google.com...]

You should of course know how to work with robots.txt...
[robotstxt.org...]

If you wanted to play with awstats... (install is kinda hard though)
awstats.sourceforge.net

There are occasional bots that will do a moderately hard crawl (I'm not concerned about bandwidth right now thankfully) but fluxiate to the point where if one doesn't hit, another does. Here are the totals of unfamous bots that have hit my site the hardest so far this year (including the big three)...

Yahoo - 88310+9433 - 596.00 MB
MSN - 53551+1526 - 1.03 GB
Googlebot - 37079+647 - 262.85 MB
WISENutbot - 7525+88 - 65.23 MB
Kolinka - 6033+604 - 139.67 MB (Forum spider)
BecomeBot - 6341+269 - 45.87 MB (Google ties?)
Ichiro - 6070+20 - 166.37 MB (Japan)
Grub - 4283+17 - 75.18 MB
ConveraCrawler - 2885+20 - 33.32 MB
Ask Jeeves - 2303+424 - 85.03 MB
LmCrawler - 2155+30 - 16.86 MB
psbot - 2026+115 - 17.63 MB (pic search)
Texas A&M IRLbot - 1745+303 - 7.45 MB
Alexa - 1201+403 - 56.44 MB
Asterias - 875+3 - 27.11 MB (Singingfish Spider)
Accoona - 857+7 - 6.05 MB

The rest of the bots stay about 6mbs or less.

Keep in mind bots will hit my site years after they've roamed my site, and vice versa (so in effect you may have to detect unknown bots that I am not currently aware of).

I'm not sure but if the size of a file can be determined by a head request then it would (if I'm correct) make better sense to head files (specifically images) to reduce bandwidth.

I also block have saved bandwidth from spammers using various methods (very effective if you have a high level abuse) though I won't discuss those methods right now.

Anyway I hope this helps some...

- John

moishier




msg:343525
 5:27 pm on Nov 8, 2005 (gmt 0)

Thanks for your detailed reply.

Does Awstats give me a report for the most active IP's making requests in the last 10 minutes? That's really the data that would be helpful in detecting bots.

Also, do they process the the log files on the fly?

JAB Creations




msg:343526
 8:46 pm on Nov 8, 2005 (gmt 0)

Well I can see a list (1,000?) of the highest bandwidth ips. There are tons of options (dns etc). Someone from the school I attend burned 1.3 GBs of bandwidth this weekend (not really concerned but I still notice such things as I radar below the radar, bwhahahah).

You can also config awstats to allow update from the browser (or set a regular interval).

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Webmaster General
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved