homepage Welcome to WebmasterWorld Guest from 54.227.40.166
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
How to block all robots except google, yahoo, msn, and ask
vphoner




msg:4187056
 3:40 pm on Aug 14, 2010 (gmt 0)

Can someone give me an example of a robots.txt that allows Google, Yahoo, MSN (bing), ASK, and Google Adwords, but blocks all others. (Are there any other good robots that I may want to add to this list?)

Is this a wise way to go? Can you block bad robots that ignore the robots.txt? I just saw a robot called 80legs scrape my site, so I think its time to block these robots that can do no good.

 

vphoner




msg:4187072
 4:32 pm on Aug 14, 2010 (gmt 0)

One more question. Is it a good idea to block the bixolabs and 80legs robots? They seem to be the ones with the most hits on my server. Is there any good reason to keep them?

wilderness




msg:4187078
 5:10 pm on Aug 14, 2010 (gmt 0)

Just an attempt to assist you on wording.

robots.txt is used for honorable bots that are willing to comply with your requests (most dis-honorable (even pests) bots don't read or comply with requests).
robots.txt does NOT really offer any controls that restricts access to your website (s).

htaccess offers valid and in most instances, restrictions imposed by webmasters, and does NOT offer the visitor (bot or otherwise) a choice.

kkrugler




msg:4187080
 5:23 pm on Aug 14, 2010 (gmt 0)

Hi vphoner,

If you want to explicitly block bixolabs, then you can add the following to your robots.txt:

User-agent: bixolabs
Disallow: /


For bots (like bixolabs) that obey robots.txt, I'd much rather prefer that you use the crawl-delay directive to constrain request rates to whatever you feel is reasonable, versus doing an outright block - but that's from the perspective of Bixo Labs (my company), of course.

If you do decide to restrict crawl rates, then it would look like:

User-agent: bixolabs
Crawl-delay: 30


This sets the max # of requests/hour to be 120 (average 30 second delay between requests), or about 3K requests/day.

-- Ken

wilderness




msg:4187094
 6:13 pm on Aug 14, 2010 (gmt 0)

If you want to explicitly block bixolabs, then you can


Once again, and merely for clarity.

robots.txt "blocks" nothing, rather it is a "request" for compliance.

thetrasher




msg:4187117
 7:33 pm on Aug 14, 2010 (gmt 0)

User-agent: Googlebot
User-agent: Slurp
User-agent: msnbot
User-agent: bingbot
User-agent: Teoma
User-agent: AdsBot-Google
Disallow:

User-agent: *
Disallow: /

# end

dstiles




msg:4187122
 7:58 pm on Aug 14, 2010 (gmt 0)

I'm against 80legs but at least it honours robots.txt - haven't seen it around since I added it. It has the really dumb robots identifier of 008 (that's zeroes not o's).

80legs is a distributed bot that cannot be traced easily to any specific IP. You cannot determine if it's a real bot or not simply by IP. MJ12bot is another one worth blocking on this principle.

Bixo is probably in a similar category since it looks to be a mining tool offered for "hire". It seems to run in Amazon's "cloud" which is well worth blocking by IP ranges - see other threads here re: Amazon AWS.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved