|How to block all robots except google, yahoo, msn, and ask|
| 3:40 pm on Aug 14, 2010 (gmt 0)|
Can someone give me an example of a robots.txt that allows Google, Yahoo, MSN (bing), ASK, and Google Adwords, but blocks all others. (Are there any other good robots that I may want to add to this list?)
Is this a wise way to go? Can you block bad robots that ignore the robots.txt? I just saw a robot called 80legs scrape my site, so I think its time to block these robots that can do no good.
| 4:32 pm on Aug 14, 2010 (gmt 0)|
One more question. Is it a good idea to block the bixolabs and 80legs robots? They seem to be the ones with the most hits on my server. Is there any good reason to keep them?
| 5:10 pm on Aug 14, 2010 (gmt 0)|
Just an attempt to assist you on wording.
robots.txt is used for honorable bots that are willing to comply with your requests (most dis-honorable (even pests) bots don't read or comply with requests).
robots.txt does NOT really offer any controls that restricts access to your website (s).
htaccess offers valid and in most instances, restrictions imposed by webmasters, and does NOT offer the visitor (bot or otherwise) a choice.
| 5:23 pm on Aug 14, 2010 (gmt 0)|
If you want to explicitly block bixolabs, then you can add the following to your robots.txt:
For bots (like bixolabs) that obey robots.txt, I'd much rather prefer that you use the crawl-delay directive to constrain request rates to whatever you feel is reasonable, versus doing an outright block - but that's from the perspective of Bixo Labs (my company), of course.
If you do decide to restrict crawl rates, then it would look like:
This sets the max # of requests/hour to be 120 (average 30 second delay between requests), or about 3K requests/day.
| 6:13 pm on Aug 14, 2010 (gmt 0)|
|If you want to explicitly block bixolabs, then you can |
Once again, and merely for clarity.
robots.txt "blocks" nothing, rather it is a "request" for compliance.
| 7:33 pm on Aug 14, 2010 (gmt 0)|
| 7:58 pm on Aug 14, 2010 (gmt 0)|
I'm against 80legs but at least it honours robots.txt - haven't seen it around since I added it. It has the really dumb robots identifier of 008 (that's zeroes not o's).
80legs is a distributed bot that cannot be traced easily to any specific IP. You cannot determine if it's a real bot or not simply by IP. MJ12bot is another one worth blocking on this principle.
Bixo is probably in a similar category since it looks to be a mining tool offered for "hire". It seems to run in Amazon's "cloud" which is well worth blocking by IP ranges - see other threads here re: Amazon AWS.