Forum Moderators: phranque
You can use robots.txt to keep some/all bots from accessing certain files. For example, if all your images are in www.example.com/images then you could just deny access to bots via robots.txt if you have no interest in being found via Google images.
I use a version of awstats that I've personally modified to detect an extensive number of browsers and robots though by default it is half decent. So far this month...
Yahoo - 2080+242 - 17.60 MB
MSN - 1400+46 - 30.11 MB
Google - 520+13 - 4.79 MB
(+242 = 242 hits on robots.txt)
Those are the three biggies... MSN tends to be kind of a bandwidth whore so I would suggest finding out where MSN is crawling that may be costing more bandwidth then you desire. Here are links to the major three's bot pages...
[search.msn.com...]
[help.yahoo.com...]
[google.com...]
You should of course know how to work with robots.txt...
[robotstxt.org...]
If you wanted to play with awstats... (install is kinda hard though)
awstats.sourceforge.net
There are occasional bots that will do a moderately hard crawl (I'm not concerned about bandwidth right now thankfully) but fluxiate to the point where if one doesn't hit, another does. Here are the totals of unfamous bots that have hit my site the hardest so far this year (including the big three)...
Yahoo - 88310+9433 - 596.00 MB
MSN - 53551+1526 - 1.03 GB
Googlebot - 37079+647 - 262.85 MB
WISENutbot - 7525+88 - 65.23 MB
Kolinka - 6033+604 - 139.67 MB (Forum spider)
BecomeBot - 6341+269 - 45.87 MB (Google ties?)
Ichiro - 6070+20 - 166.37 MB (Japan)
Grub - 4283+17 - 75.18 MB
ConveraCrawler - 2885+20 - 33.32 MB
Ask Jeeves - 2303+424 - 85.03 MB
LmCrawler - 2155+30 - 16.86 MB
psbot - 2026+115 - 17.63 MB (pic search)
Texas A&M IRLbot - 1745+303 - 7.45 MB
Alexa - 1201+403 - 56.44 MB
Asterias - 875+3 - 27.11 MB (Singingfish Spider)
Accoona - 857+7 - 6.05 MB
The rest of the bots stay about 6mbs or less.
Keep in mind bots will hit my site years after they've roamed my site, and vice versa (so in effect you may have to detect unknown bots that I am not currently aware of).
I'm not sure but if the size of a file can be determined by a head request then it would (if I'm correct) make better sense to head files (specifically images) to reduce bandwidth.
I also block have saved bandwidth from spammers using various methods (very effective if you have a high level abuse) though I won't discuss those methods right now.
Anyway I hope this helps some...
- John
You can also config awstats to allow update from the browser (or set a regular interval).