I've seen Nutch being used by almost everyone in the last few years: school CS projects, Yahoo, Overture, Internet Archive, and dozens of unknown sources. At one time I supported the NutchOrg project's efforts and allowed access to all Nutch agents, however it really got out of hand. I was seeing it come from everywhere.
So I changed my mind and decided to no longer allow Nutch. At first I denied it in robots.txt, but most Nutch variants ignored it, so I pulled the plug altogether and banned all UAs containing "nutch" via .htaccess.
How do I know Nutch is not being used to scrape content or copy entire sites to remote servers in other countries that I will never know about? I find my content infringed on web sites, forums and blogs all the time and almost always have DMCA papers in action.
All the threads I find at WW are pretty old, and before we had much info on this bot. What's the latest? Any hard data Nutch is being used for nefarious purposes?