Have been playing with the WebMaster World spider.txt checker and out of interest ran the webmasterworld.com/spider.txt through it. I was surprised at the number of excluded agents. Many seem to be email harvesters and site downloaders, which clearly makes sense.
Do people have a list of "nuisance" agents they suggest should be excluded by default for most sites?
Check the site in my profile. One of the files I offer for download is a regularly updated robots.txt. For robots that don't obey robots.txt you can check the [Website Strippers] section of my browscap.ini file for the user agents I consider a nusiance.
[edited by: GaryK at 12:36 am (utc) on Nov. 4, 2002]
GaryK - thanks for the feedback, but no website listed in your profile.
Macguru - wow, [webmasterworld.com ] what a thread. I'm not a code jockey and not confident about playing with the htaccess file - am I right thinking you use that for bots that don't respect robots.txt?