Forum Moderators: goodroi

Message Too Old, No Replies

Help with Spiders

Cutting back on the bandwidth

         

devnet

4:30 pm on Jun 30, 2006 (gmt 0)

10+ Year Member



Hiya

I'm new here and had questions concerning what others do to cut back on traffic from spiders and also how to direct spiders more efficiently to robots.txt. I know I'm probably not putting this in the right forum...but there are so many here and I didn't want to put it in the wrong place right away.

Here's what I get for bot traffic (Month of June):

Inktomi Slurp
Hits: 13191+26234
Bandwidth: 42.47 MB
Last Visit: 30 Jun 2006 - 11:46

Googlebot
Hits: 6963+1152
Bandwidth: 14.49 MB
Last Visit: 30 Jun 2006 - 11:30

MSNBot3263+7129
Bandwidth: 5.94 MB
Last Visit: 30 Jun 2006 - 11:42

Unknown robot (identified by 'crawl')
Hits: 2618+162
Bandwidth: 59.89 MB
Last Visit: 30 Jun 2006 - 11:44

Unknown robot (identified by hit on 'robots.txt')
Hits: 0+1306
Bandwidth: 296.94 KB
Last Visit: 30 Jun 2006 - 11:25

Unknown robot (identified by 'robot')
Hits: 1132+25
Bandwidth: 73.43 MB
Last Visit: 30 Jun 2006 - 11:19

EchO!
Hits: 933
Bandwidth: 12.02 MB
Last Visit: 30 Jun 2006 - 05:26

Unknown robot (identified by 'spider')
Hits: 239+81
Bandwidth: 11.26 MB
Last Visit: 30 Jun 2006 - 04:45

Alexa (IA Archiver)
Hits: 147+17
Bandwidth: 9.29 MB
Last Visit: 30 Jun 2006 - 08:23

WISENutbot
Hits: 118+10
Bandwidth: 4.37 MB
Last Visit: 27 Jun 2006 - 21:34

AskJeeves
Hits: 63+22
Bandwidth: 5.73 MB
Last Visit: 29 Jun 2006 - 04:59

Voyager
Hits: 61+15
Bandwidth: 3.49 MB
Last Visit: 30 Jun 2006 - 06:41

Netcraft
Hits: 260
Bandwidth: None
Last Visit: 29 Jun 2006 - 20:36

Walhello appie
Hits: 5+32
Bandwidth: 72.92 KB
Last Visit: 10 Jun 2006 - 13:45

Scooter
Hits: 3+3
Bandwidth: 32.78 KB
Last Visit: 25 Jun 2006 - 07:02

ht://Dig
Hits: 2+3
Bandwidth: 94.44 KB
Last Visit: 11 Jun 2006 - 05:30

The question I have is...on the "hits" above, the first number is how many times the spider 'hit' my page...and the second number (after the "+") is how many times it successfully requested robots.txt. Should a spider hit robots each time it 'hits' a page aka requests a page?

Secondly, my robots file is in place for all agents and disallows to all directories except my images (which don't change that much unless I post a new blog entry) and the entries themselves. Should I put more stringent rules there or in .htaccess to direct spiders? Anyone think I'm being hit too much and for too much bandwidth?

I have a plan with 500GB monthly so I'm not in danger of going over...but I have 2 other sites on this same plan and they share bandwidth along with it...all 2 other sites have graphics and documents to download so there may be some day that I'll come close to 500GB...so I want to squeeze what I can out of it now to prepare for that. Anyone have advice/suggestions for me?

Pfui

4:56 am on Jul 5, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



devnet, I'm sorry but I find your post a bit mind-boggling (which may explain why you've yet to receive any reply after five days, sorry). To sort of touch on most of your Qs, here goes:

...what others do to cut back on traffic from spiders...

People use everything from traps to special programs to server-specific features including .htaccess and mod_rewrite, etc., to passwords and captchas, to blacklisting by UA and IP to whitelisting non-bots. And more.

...how to direct spiders more efficiently to robots.txt...

Proper robots ask for it by name; rude/bad bots don't. Making the non-compliant eat it won't matter a bit because they'll spit it out it anyway. Best thing to do is make sure your robots.txt file is up to snuff. (Tons of posts in this forum will show you how-to.)

...I know I'm probably not putting this in the right forum...but there are so many here and I didn't want to put it in the wrong place right away...

Actually, there's crossover and since you're talkin' 'bout bots, you've landed in a pretty good place -- you're surrounded by semi-obsessive log-scanning bot-watchers [webmasterworld.com]:)

...Should a spider hit robots each time it 'hits' a page aka requests a page?...

In short: No. Some retrieve robots.txt seemingly every time they visit, others hardly ever seem to but they heed it. The patterns and frequency depend on each company/engine, their current algorithms, the number of crawlers/servers/etc., what may be happening upstream, how often your site updates, and about a gazillion other factors.

...my robots file is in place for all agents and disallows to all directories except my images (which don't change that much unless I post a new blog entry) and the entries themselves. Should I put more stringent rules there or in .htaccess to direct spiders?...

If you don't want SEs to use your bandwidth to show your images in their 'image search' results (also opening yourself up to image hijacking, and/or making a mess of any anti-hijacking code), then Disallow any image directories. That will cover the good bots.

As far as outright blocking bad bots and scrapers and such, again because they ignore robots.txt, you can use .htaccess. That's easier said than done, so check out Jim Morgan's Apache Web Server [webmasterworld.com] forum, its Charter and Library. And if you're new to mod_rewrite, mod_access, SetEnv, etc., go slowly. (And if you're on a Windows box -- well, there are some similarities. Again, check the Charter and Library.)

...Anyone think I'm being hit too much and for too much bandwidth?...

I'll leave the math (and your data recaps) to others:) Suffice it to say that even some good bots 'cost' a lot, while other good bots don't. (And, of course, bad bots are a complete waste.) Whether or not any bot's incoming traffic 'benefits' outweigh its bandwidth costs are for you to assess, based on your pocketbook, your type of site(s) (e.g., ads or not), your long-range goals, etc.

For me, Google's worth it, with MSN and Yahoo tied for a distant second place. Most of the others I see are already using G's or MSN's SERPs anyway, so I'm more hard-liner than some, not as much as others.

Whew. YMMV & HTH! :)