homepage Welcome to WebmasterWorld Guest from 54.166.65.9
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Block all Nutch variants. Possible?
morags




msg:3357923
 12:57 pm on Jun 4, 2007 (gmt 0)

I've just been crawled by another Nutch variant. Rather than having to add the user agent to robots.txt for each variant, is it possible to use a wildcard to disallow all spiders which have 'Nutch' anywhere in the user agent?

e.g.

User-agent: *Nutch*
Disallow: /

Ta.

 

goodroi




msg:3358335
 7:33 pm on Jun 4, 2007 (gmt 0)

according to the nutch website
Different installations of the Nutch software may specify different agent names, but all should respond to the agent name "Nutch". Thus to ban all Nutch-based crawlers from your site, place the following in your robots.txt file:

User-agent: Nutch
Disallow: /

good luck

morags




msg:3359423
 8:28 pm on Jun 5, 2007 (gmt 0)

Thanks for the pointer. I've had this specified in robots.txt for some time - and still they appear. Maybe some variants do obey this instruction, but sadly not all.

Will add the wildcards too - and see if it makes any difference.

incrediBILL




msg:3359429
 8:37 pm on Jun 5, 2007 (gmt 0)

The only way to truly block all the noise, including all the nutch's, is to do your robots.txt and .htaccess file in WHITELIST format so everything else goes away. Tell 'em nicely in robots.txt and keep 'em out by force in .htaccess.

Example:

# allowed bots here
User-agent: Googlebot
User-agent: Slurp
User-agent: Teoma
Crawl-delay: 2
Disallow: /cgi-bin

# everyone else jump off a cliff
User-agent: *
Disallow: /


Conard




msg:3359430
 8:37 pm on Jun 5, 2007 (gmt 0)

Trying to block all the Nutch bots with robots.txt will not stop them.

I've caught several variants in directories they are "asked" not to crawl. I block them and all bots with .htaccess from places they shouldn't be.

incrediBILL




msg:3359453
 9:02 pm on Jun 5, 2007 (gmt 0)

I block them and all bots with .htaccess

You block the ones that want to be seen and the rest are having a party on your server right now as I'm typing this as they zip right past .htaccess in stealth mode.

However, it's the best you can do with the tools Apache gives with the server.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved