homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

Block all Nutch variants. Possible?

 12:57 pm on Jun 4, 2007 (gmt 0)

I've just been crawled by another Nutch variant. Rather than having to add the user agent to robots.txt for each variant, is it possible to use a wildcard to disallow all spiders which have 'Nutch' anywhere in the user agent?


User-agent: *Nutch*
Disallow: /




 7:33 pm on Jun 4, 2007 (gmt 0)

according to the nutch website
Different installations of the Nutch software may specify different agent names, but all should respond to the agent name "Nutch". Thus to ban all Nutch-based crawlers from your site, place the following in your robots.txt file:

User-agent: Nutch
Disallow: /

good luck


 8:28 pm on Jun 5, 2007 (gmt 0)

Thanks for the pointer. I've had this specified in robots.txt for some time - and still they appear. Maybe some variants do obey this instruction, but sadly not all.

Will add the wildcards too - and see if it makes any difference.


 8:37 pm on Jun 5, 2007 (gmt 0)

The only way to truly block all the noise, including all the nutch's, is to do your robots.txt and .htaccess file in WHITELIST format so everything else goes away. Tell 'em nicely in robots.txt and keep 'em out by force in .htaccess.


# allowed bots here
User-agent: Googlebot
User-agent: Slurp
User-agent: Teoma
Crawl-delay: 2
Disallow: /cgi-bin

# everyone else jump off a cliff
User-agent: *
Disallow: /


 8:37 pm on Jun 5, 2007 (gmt 0)

Trying to block all the Nutch bots with robots.txt will not stop them.

I've caught several variants in directories they are "asked" not to crawl. I block them and all bots with .htaccess from places they shouldn't be.


 9:02 pm on Jun 5, 2007 (gmt 0)

I block them and all bots with .htaccess

You block the ones that want to be seen and the rest are having a party on your server right now as I'm typing this as they zip right past .htaccess in stealth mode.

However, it's the best you can do with the tools Apache gives with the server.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved