homepage Welcome to WebmasterWorld Guest from 54.198.130.203
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Ban All Nutch Variants?
keyplyr




msg:3407576
 8:30 am on Jul 29, 2007 (gmt 0)

I've seen Nutch being used by almost everyone in the last few years: school CS projects, Yahoo, Overture, Internet Archive, and dozens of unknown sources. At one time I supported the NutchOrg project's efforts and allowed access to all Nutch agents, however it really got out of hand. I was seeing it come from everywhere.

So I changed my mind and decided to no longer allow Nutch. At first I denied it in robots.txt, but most Nutch variants ignored it, so I pulled the plug altogether and banned all UAs containing "nutch" via .htaccess.

How do I know Nutch is not being used to scrape content or copy entire sites to remote servers in other countries that I will never know about? I find my content infringed on web sites, forums and blogs all the time and almost always have DMCA papers in action.

All the threads I find at WW are pretty old, and before we had much info on this bot. What's the latest? Any hard data Nutch is being used for nefarious purposes?

 

wilderness




msg:3407743
 5:34 pm on Jul 29, 2007 (gmt 0)

keyplr,
I've had all versions of Nutch denied for more than four-and-one-half years.

I seem to recall Jim having some formal discussions with the Nutch folks and when a solution failed, Jim added a deny for Nutch as well?

Don

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved