dstiles - 10:59 pm on Feb 12, 2012 (gmt 0)
Lucy - I decided about two years ago to put a list of my "allowed" bots in the Alternative forum hereabouts - still haven't found time. :(
Yours is a good list BUT - you give several IP ranges that I would never have permitted or, once blocked, released. Once a server farm has been discovered it's killed. If it's ever switched to DSL in future it'll have to convince me. If there is a good bot on the range then I whitelist the bot WITH the IPs it uses.
I do not have php on my server so ANYTHING that comes in asking for a php extension gets killed. There have been an awful lot of them recently, all as far as I can tell from botnets.
I've allowed Seznam for some time with no perceived problems. It's a "local" SE for CZ as far as I recall.
I began by allowing YahooCacheSystem but it's killing itself slowly, especially from the Eastern IPs. With no reason to allow bot traffic, since MS provides their SERPS now, I may also kill slurp.
Like you, I kill baidu China and allow baidu Japan. I'm not at all sure they do not share their scans, though. I only allow the JP one because a client has trade from there.
Bill - I cannot see a way (at least using my system) of white-list only. A LOT of bots come with perfectly legit UAs but they are obviously bots. Many come from server farms which, once found, are easy to block (my IP ranges are in MySQL so "file size" isn't much of a problem). I get a lot of botnet junk from servers and broadband so I can't white-list any dynamics as such, only allow a broadband range until one of its IPs mis-behaves, when it gets a 403 for a while or, if persistent, gets added to the Always Ban list.
Frank - if you really mean "robots.txt" then you will miss 90% of evil bot traffic. It has to be htaccess or something similar because bad bots never check in.