Welcome to WebmasterWorld Guest from 174.129.127.214

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

How to block all robots except google, yahoo, msn, and ask

   
3:40 pm on Aug 14, 2010 (gmt 0)

5+ Year Member



Can someone give me an example of a robots.txt that allows Google, Yahoo, MSN (bing), ASK, and Google Adwords, but blocks all others. (Are there any other good robots that I may want to add to this list?)

Is this a wise way to go? Can you block bad robots that ignore the robots.txt? I just saw a robot called 80legs scrape my site, so I think its time to block these robots that can do no good.
4:32 pm on Aug 14, 2010 (gmt 0)

5+ Year Member



One more question. Is it a good idea to block the bixolabs and 80legs robots? They seem to be the ones with the most hits on my server. Is there any good reason to keep them?
5:10 pm on Aug 14, 2010 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Just an attempt to assist you on wording.

robots.txt is used for honorable bots that are willing to comply with your requests (most dis-honorable (even pests) bots don't read or comply with requests).
robots.txt does NOT really offer any controls that restricts access to your website (s).

htaccess offers valid and in most instances, restrictions imposed by webmasters, and does NOT offer the visitor (bot or otherwise) a choice.
5:23 pm on Aug 14, 2010 (gmt 0)

5+ Year Member



Hi vphoner,

If you want to explicitly block bixolabs, then you can add the following to your robots.txt:

User-agent: bixolabs
Disallow: /


For bots (like bixolabs) that obey robots.txt, I'd much rather prefer that you use the crawl-delay directive to constrain request rates to whatever you feel is reasonable, versus doing an outright block - but that's from the perspective of Bixo Labs (my company), of course.

If you do decide to restrict crawl rates, then it would look like:

User-agent: bixolabs
Crawl-delay: 30


This sets the max # of requests/hour to be 120 (average 30 second delay between requests), or about 3K requests/day.

-- Ken
6:13 pm on Aug 14, 2010 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



If you want to explicitly block bixolabs, then you can


Once again, and merely for clarity.

robots.txt "blocks" nothing, rather it is a "request" for compliance.
7:33 pm on Aug 14, 2010 (gmt 0)

5+ Year Member



User-agent: Googlebot
User-agent: Slurp
User-agent: msnbot
User-agent: bingbot
User-agent: Teoma
User-agent: AdsBot-Google
Disallow:

User-agent: *
Disallow: /

# end
7:58 pm on Aug 14, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I'm against 80legs but at least it honours robots.txt - haven't seen it around since I added it. It has the really dumb robots identifier of 008 (that's zeroes not o's).

80legs is a distributed bot that cannot be traced easily to any specific IP. You cannot determine if it's a real bot or not simply by IP. MJ12bot is another one worth blocking on this principle.

Bixo is probably in a similar category since it looks to be a mining tool offered for "hire". It seems to run in Amazon's "cloud" which is well worth blocking by IP ranges - see other threads here re: Amazon AWS.