I've just been crawled by another Nutch variant. Rather than having to add the user agent to robots.txt for each variant, is it possible to use a wildcard to disallow all spiders which have 'Nutch' anywhere in the user agent?
Different installations of the Nutch software may specify different agent names, but all should respond to the agent name "Nutch". Thus to ban all Nutch-based crawlers from your site, place the following in your robots.txt file:
The only way to truly block all the noise, including all the nutch's, is to do your robots.txt and .htaccess file in WHITELIST format so everything else goes away. Tell 'em nicely in robots.txt and keep 'em out by force in .htaccess.