Forum Moderators: goodroi
The definition of a "bad bot" would vary depending on who you ask, as there are various behaviours that can be/are considered bad.
e.g. not reading robots.txt, disobying robots.txt, requesting too many pages in a short time span, revisiting pages too often, e-mail harvesting, Guestbook spamming, Log Spamming, munging URL's.
Some of the above are rather subjective measures (e.g. how often is too often) and others a require some effort to identify (e.g. e-mail harvesting).
Trying to use robots.txt for bad bots that either don't read robots.txt or disobey it will fail to work of course, so other methods e.g. .htaccess will have to be used against them.
Some bad bots change both their User Agent and IP address frequently, so the only way to ban these is to have a robot trap (search:robot trap) to automatically ban them.
My biggest concern is content theft, down the road. For this reason, the thought of banning all bots aside from Yahoo, googlebot, and msnbot seems appealing. Also, since the site will be an english language one, I have no need for visitors from china, south korea, or any number of other countries. Ideally, I would like to ban their access, though I haven't a clue as to how you can gather all the various IP ranges for certain countries and ban their access wholesale.
How do you do this? I downloaded a list of IP ranges once (I think it might have been geoip-something), but it didn't give you, say, a whole contiguous list of all the ranges for china. Instead it listed all the ranges and showed which sliver of the range went to china, which piece of the range went to canada, etc, etc.
From looking at it, it looked as though to ban one country such as china, or india (just examples), you'd have to track down and input hundreds and hundreds of chunks of IP ranges.
I use [ip-to-country.webhosting.info...]
in my case I don't want anyone from a list of countries to be able to signup with us
get the ip of the person trying to signup
use that to get their country from mysql database
if that country is not allowed stop them
you could read the thread I mentioned above re: htaccess and use REMOTE_ADDR in a method such as
SetEnvIf REMOTE_ADDR ^(127\.0\.0\.1¦192\.168\.2\.¦192\.168\.3\.¦10\.) bad-ip
<Directory /docroot>
Order Deny,Allow
Deny from env=bad-ip
Allow from All
</Directory>
ips would need to be changed and some paths fixed for your own site
you could also read this for reference
[httpd.apache.org...]