With regard to spiders and rogue bots... and controlling their access to my charity web sites
Given that rogue spiders often ignore robots.txt, what methods can be used (eg: .htaccess statements) to restrict access to the whole site to a very small number of recognised large legit search engines?
I'm not worried about commercial SEO requirements, I have very very restricted "search" requirements for my sites - which is that people searching on the very obvious terms relating to the name of the site/charity, or specific popular page titles, find us, and that seems to work fine on our current keyword and title and content strategy. I've found that getting a listing in Open Directory/dmoz.org and sorting keywords and content and page title metatags seems to satisfy most of my ranking requirements.
But I would like to save a bit of bandwidth and improve security by a robust blunderbuss banning of all spiders except the ones I choose - which would probably be google, yahoo slurp, bing, and one or two others.
I already run bot traps via .htaccess which result in some rogue bots collecting automatic IP address bans for themselves - but that does make for a very big .htaccess file as the list grows.
Anyone got any suggestions for appropriate global banning .htaccess statements and also specific allow statements?
And a list of mainstream web search engines which it would be sensible to allow?
Many thanks to any who can help.