Welcome to WebmasterWorld Guest from 23.22.140.143

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

How to block all robots except google, yahoo, msn, and ask

     
3:40 pm on Aug 14, 2010 (gmt 0)

Full Member

10+ Year Member

joined:Aug 27, 2005
posts:300
votes: 0


Can someone give me an example of a robots.txt that allows Google, Yahoo, MSN (bing), ASK, and Google Adwords, but blocks all others. (Are there any other good robots that I may want to add to this list?)

Is this a wise way to go? Can you block bad robots that ignore the robots.txt? I just saw a robot called 80legs scrape my site, so I think its time to block these robots that can do no good.
4:32 pm on Aug 14, 2010 (gmt 0)

Full Member

10+ Year Member

joined:Aug 27, 2005
posts:300
votes: 0


One more question. Is it a good idea to block the bixolabs and 80legs robots? They seem to be the ones with the most hits on my server. Is there any good reason to keep them?
5:10 pm on Aug 14, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Just an attempt to assist you on wording.

robots.txt is used for honorable bots that are willing to comply with your requests (most dis-honorable (even pests) bots don't read or comply with requests).
robots.txt does NOT really offer any controls that restricts access to your website (s).

htaccess offers valid and in most instances, restrictions imposed by webmasters, and does NOT offer the visitor (bot or otherwise) a choice.
5:23 pm on Aug 14, 2010 (gmt 0)

New User

5+ Year Member

joined:Dec 9, 2009
posts: 4
votes: 0


Hi vphoner,

If you want to explicitly block bixolabs, then you can add the following to your robots.txt:

User-agent: bixolabs
Disallow: /


For bots (like bixolabs) that obey robots.txt, I'd much rather prefer that you use the crawl-delay directive to constrain request rates to whatever you feel is reasonable, versus doing an outright block - but that's from the perspective of Bixo Labs (my company), of course.

If you do decide to restrict crawl rates, then it would look like:

User-agent: bixolabs
Crawl-delay: 30


This sets the max # of requests/hour to be 120 (average 30 second delay between requests), or about 3K requests/day.

-- Ken
6:13 pm on Aug 14, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


If you want to explicitly block bixolabs, then you can


Once again, and merely for clarity.

robots.txt "blocks" nothing, rather it is a "request" for compliance.
7:33 pm on Aug 14, 2010 (gmt 0)

Junior Member

10+ Year Member

joined:June 25, 2005
posts:179
votes: 1


User-agent: Googlebot
User-agent: Slurp
User-agent: msnbot
User-agent: bingbot
User-agent: Teoma
User-agent: AdsBot-Google
Disallow:

User-agent: *
Disallow: /

# end
7:58 pm on Aug 14, 2010 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3092
votes: 2


I'm against 80legs but at least it honours robots.txt - haven't seen it around since I added it. It has the really dumb robots identifier of 008 (that's zeroes not o's).

80legs is a distributed bot that cannot be traced easily to any specific IP. You cannot determine if it's a real bot or not simply by IP. MJ12bot is another one worth blocking on this principle.

Bixo is probably in a similar category since it looks to be a mining tool offered for "hire". It seems to run in Amazon's "cloud" which is well worth blocking by IP ranges - see other threads here re: Amazon AWS.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members